Condor Cluster

How to use condor:

First you need a job you want to run. Here is one, to be stored in a file called batch.R:

Sys.sleep(floor(runif(1,1,15))) #just to have this sample job take some time
cat(date(),file=paste("datafile",floor(runif(1,0,100000)),".txt",sep=""),append=FALSE)

Then you need a bash script to run the job. Call it myprog. Here you can see it can run R, but also other commands, like echo


#! /bin/sh
R CMD BATCH batch.R
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
exit 42

make the file executable: chmod u+x myprog

Now make a script with commands for condor, call it myRjob.submit:


executable=myprog
universe=vanilla
arguments=Example.$(Cluster).$(Process) 5
output=results.output.$(Process)
error=results.error.$(Process)
transfer_input_files=batch.R
log=results.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queue 17

Include the names of any necessary input files in the transfer_input_files argument. Comma delimited, with NO spaces on either side of comma. Also see the queue 17 argument: that will batch 17 copies of this job.

Now to run: in terminal, do
condor_submit myRjob.submit

To get help with a command, do command_name -h

condor_submit: followed by your submission file, sends off your jobs (like qsub)

condor_q: tells you what is running (like qstat)
Some of your jobs may be held. This means that at least one of them had an error. Condor stops submitting them until you think the error is fixed. You can then start them up again (even the one(s) that failed) using condor_resume -username, substituting your condor username for username. It could be that only one node is having problems, so you can just resume and hope that they’ll end up on other nodes. Better is to find the error, of course.

condor_status: tells you the status of nodes

condor_rm: followed by user name, removes that user’s jobs (no, you can’t delete someone else’s). You can also specify individual job numbers or cluster numbers (a set of related jobs). See condor_rm -h for more options.

Error messages:

“About to exec /condor/condor-install/var/execute/dir_7640/condor_exec.exe” followed by “Create_Process: child failed with errno 2 (No such file or directory) before exec()”: Probably means that something you’re trying to run is not installed on that machine.

How to set up a new condor

Our lab uses both UTK’s shared computing cluster, Newton, which uses Sun Grid Engine for scheduling, and a set of Mac Pros using Condor for scheduling. Condor is nice as you can take advantage of unused CPU cycles on existing machines. It can be configured to only run when the computer is not otherwise being used (as measured by mouse/keyboard activity), though we run it so that jobs can keep running while people are working. Here is how we currently install Condor:

  • Made a condor user using system preferences
  • Enabled root user on OS X
  • Used instructions at http://switchingtolinux.blogspot.com/2008/12/installing-condor-in-osx-105.html
  • su to be superuser
  • chmod a+rx ~condor
  • cd /
  • mkdir condor
  • cd /condor
  • put *tar.gz condor distro into /condor (will probably need to do this from another terminal window: sudo mv condir*.tar.gz /condor)
  • tar -xvf *tar.gz
  • cd condor-7*
  • If manager: ./condor_configure –install –install-dir=/condor/condor-installed –local-dir=/condor/condor-installed/var –owner=condor –type=manager,submit,execute
    If execution/submission node: ./condor_configure –install –install-dir=/condor/condor-installed –local-dir=/condor/condor-installed/var –owner=condor –type=execute,submit –central-manager=omearalab13.bio.utk.edu
  • PATH=$PATH:/condor/condor-installed/bin/:/condor/condor-installed/sbin/
  • cp /PATH/TO/condor_config_TEMPLATE /condor/condor-installed/etc/condor_config (condor_config_TEMPLATE is my modification of the condor_config file)
  • cp /PATH/TO/condor_config.local_TEMPLATE /condor/condor-installed/var/condor_config.local
  • You may have to change CONDOR_IDS in condor_config.local to the output from “id -u condor”
  • ln -s /condor/condor-installed/etc/condor_config ~condor/
  • condor_master
  • OS X will ask if you want to allow various condor daemons to have network access. You do.
  • ps -ef|grep condor
  • condor_status to see that it is running (might take a literal minute for polling the nodes)
  • Edit root’s crontab: while still in su, crontab -e, then @reboot /condor/condor-installed/sbin//condor_master