What is Condor?
Condor is a job scheduling system that takes submissions, finds resources to run them and reports back on the results once the job has completed. It scavenges resources (hence the name) from all associated machines during idle periods, so as not to disturb the users of these machines. If you need to run multiple instances of a program, Condor is the best way to achieve that.
What Condor is not
Condor is not multithreaded computing. It operates by dividing up jobs by thread and treating each thread requested as a separate task to dole out to available processors. Since each thread is handled by itself on its own designated core, it is not multithreaded.
Benefits of Using Condor
- Single location to run all your jobs: Instead of going from machine to machine, running as many threads as you can safely start on each, you go to submit.stat.duke.edu and tell Condor to run all your needed jobs from there. It finds the resources on its own.
- No more crippling other people's machines: Since Condor finds, allocates, moves and removes tasks from users computers automatically, you won't have to worry about misjudging the resources needed for running a job on someone else's computer and making it unusable by them. Condor monitors for idle systems to utilize and moves jobs away when users return to their machines.
Disadvantages of Condor
- Need to recompile compiled code for long running jobs: While not really a necessity, still a good idea. When compiled with condor_compile, condor will checkpoint long running processes as they run, so if the need to move the process to another machine arises, the job will not have to start over from the beginning.
Brief Overview of Usage
Submitting a Job
Condor jobs are started by writing a file governing the aspects of the job, then submitted to the queue on the Condor submit server (submit.stat.duke.edu).
- Write a submission file with all the details of the job you wish to run.
- SSH to the submit server (submit.stat.duke.edu).
- Run 'condor_submit submitfilename'.
- Wait for emails indicating completion of each thread.
More complete instructions are on the submission file instructions page.
Monitoring and Management Commands
- This command lists all the jobs you're currently running, giving condor process IDs, state, activity and activity time.
- condor_q -analyze <condor process id>
- Gives more information about a specific process.
- Lists all the jobs running in the Condor pool.
- condor_release <condor process id>
- Releases a job from a held state, due to problems such as authentication failures or hangs.
- condor_rm <condor process id>
- Cancel and remove the given Condor process from the queue.