Suggestions on Using the
Advance Reservation Service (ARS)
Observations about ARS and PBS and how to increase
of successfully running jobs under ARS
The Advance Reservation Service (ARS) was written by Lockheed-Martin's Enterprise team as a way to make the HPCMP systems easier to use. ARS is intended to ease the bottleneck of software licenses, to use resources effectively as reported to it by those systems. ARS runs on a server at the ARL DSRC in Aberdeen, Maryland. To make a request for nodes on Lightning, Predator, or Spirit, the user interacts with ARS.
The Portable Batch System (PBS) is the batch scheduler written by ALTAIR and runs on Lightning, Predator, and Spirit. It decides which jobs will begin execution and manages and monitors job execution. PBS receives notices or requests from ARS to reserve resources and responds back to ARS confirming that those resources have been reserved.
Since there is no direct path between ARS and PBS, at some point ARS and PBS may stop talking to each other. As a result, occasionally ARS will become unaware of the rapidly changing resource environment. Normally ARS receives a request from a user for N nodes for some future time X. ARS then turns to PBS for confirmation to set aside N nodes for that job at time X. PBS assembles the requested nodes, assigning them to the request from ARS. When the reservation time X arrives PBS then attempts to start the job. But, if something bad happens to one of the N reserved nodes before time X arrives, ARS has no way of knowing. But, PBS will know and will not attempt to run the job using the broken node. Now the user is confronted with a job that won't run despite ARS claiming that all is well.
We recommend that you request at least two more nodes than you need. And, if more nodes are needed, the increased number should be a multiple of two. You'll be taking a hit on your project's allocation, but it acts as a kind of job insurance.