Distributed Compuation in the Labs
Background
These scripts make it easy to run and monitor the execution of processes across all the computers in the CS labs at York.
As part of my final year project, I needed to run a vast number of benchmarks in a short space of time. By distributing the benchmarks over 70 unused computers, I could get around three weeks of solid computation done in one night.
Disclaimer: It’s probably best to be careful with what you do with these scripts. I imagine the administrators allow this computation to happen at their discretion, so it’s best to not abuse it.
Steps for use
These tools only run on Linux and need the worker machines to be booted into Linux, so restart as many machines as you require into Linux.
Firstly, log on to a lab PC. Get a list of all lab PCs which you want to run computation on and put them in a file called ‘.labpcs’ in your home directory. This can be done manually or automatically:
- Manually: look at the stickers on the machines and note the name for each one – the name will be ‘pcXs’, where X is a three-digit number, e.g. pc172s. Put these names in the .labpcs file, one on each line.
-
Automatically: run the command
labpcs > ~/.labpcsThe labpcs command scans all of the lab PCs and lists those which are booted into Linux and not currently in use.
With the list of worker machines in place, you can now use the labrun command to execute commands across all of the machines and track their progress. labrun works by allowing you to specify (or generate) a list of commands to be executed. It takes care of allocating commands to each free computer and restarting the command if the computer goes down mid-execution. Use is illustrated by example.
Examples
Suppose there were three worker machines listed in ~/.labpcs, and you wanted to run the standard sleep command 5 times, sleeping for different amounts of time. You would use the following to do so:
The first worker will execute sleep 20, the second will execute sleep 21, and the third will execute sleep 22. When any worker finishes running its command, it will be allocated the next command (in this case, sleep 23). labrun will terminate when all commands have been completed.
Another example: suppose you wanted to build a word Markov chain by training on a corpus. Your corpus consists of a large number of text files in a folder, and you have a program, markovtrain, which takes as its sole argument the path to a text file to be trained on. If your corpus folder is called ‘corpus’ in your home directory and the path to the executable markovtrain is ~/foo/markovtrain, then you can use:
Complex arguments with Python
Clearly it becomes impractical to list all different arguments if there are hundreds of commands to be executed. labrun allows you to specify the arguments using a Python expression. The sleep example above could be executed with:
If the first argument starts with ‘py:’, then the following text is evaluated as Python code. The code must produce a list, and that list is taken to be the distributed command arguments.
For my project, I used a list comprehension to describe the different combinations of arguments to the program which I wanted to run. It looked something like:
This executed each of the 1000 benchmarks for a few different instance sizes.
Download
Instructions
Download labrun and labpcs, make them executable, and place them somewhere in your $PATH (or specify the full path to them when running).Possible further improvements
When someone else logs on to a lab computer it would be polite to stop using it for computation; as labrun stands, the new user can already restart the computer to stop it being used, but that’s not quite good enough.
I heard a rumour that it’s possible to remotely start machines in the labs. If that’s the case, then it might be a useful feature to add to labpcs.
Working from outside the labs
If you’d like to start the computation from a computer outside of the labs, a little extra work is needed because of the authentication system used in the department. You must first ensure your user has a passphrase-less SSH key created using ssh-keygen. Now, add your public key to the authorized keys file:
This allows you to SSH into each lab computer once inside the departmental network (but not inside the labs) without entering a password, thus allowing labrun to function. You can now SSH to your-user-name@milan.cs.york.ac.uk and proceed as normal with labpcs/labrun.