Paul Giannaros

Distributed Compuation in the Labs

Background

These scripts make it easy to run and monitor the execution of processes across all the computers in the CS labs at York.

As part of my final year project, I needed to run a vast number of benchmarks in a short space of time. By distributing the benchmarks over 70 unused computers, I could get around three weeks of solid computation done in one night.

Disclaimer: It’s probably best to be careful with what you do with these scripts. I imagine the administrators allow this computation to happen at their discretion, so it’s best to not abuse it.

Steps for use

These tools only run on Linux and need the worker machines to be booted into Linux, so restart as many machines as you require into Linux.

Firstly, log on to a lab PC. Get a list of all lab PCs which you want to run computation on and put them in a file called ‘.labpcs’ in your home directory. This can be done manually or automatically:

  • Manually: look at the stickers on the machines and note the name for each one – the name will be ‘pcXs’, where X is a three-digit number, e.g. pc172s. Put these names in the .labpcs file, one on each line.
  • Automatically: run the command
    labpcs > ~/.labpcs
    The labpcs command scans all of the lab PCs and lists those which are booted into Linux and not currently in use.

With the list of worker machines in place, you can now use the labrun command to execute commands across all of the machines and track their progress. labrun works by allowing you to specify (or generate) a list of commands to be executed. It takes care of allocating commands to each free computer and restarting the command if the computer goes down mid-execution. Use is illustrated by example.

Examples

Suppose there were three worker machines listed in ~/.labpcs, and you wanted to run the standard sleep command 5 times, sleeping for different amounts of time. You would use the following to do so:

labrun sleep 20 21 22 23 24

The first worker will execute sleep 20, the second will execute sleep 21, and the third will execute sleep 22. When any worker finishes running its command, it will be allocated the next command (in this case, sleep 23). labrun will terminate when all commands have been completed.

Another example: suppose you wanted to build a word Markov chain by training on a corpus. Your corpus consists of a large number of text files in a folder, and you have a program, markovtrain, which takes as its sole argument the path to a text file to be trained on. If your corpus folder is called ‘corpus’ in your home directory and the path to the executable markovtrain is ~/foo/markovtrain, then you can use:

labrun ~/foo/markovtrain ~/corpus/*.txt

Complex arguments with Python

Clearly it becomes impractical to list all different arguments if there are hundreds of commands to be executed. labrun allows you to specify the arguments using a Python expression. The sleep example above could be executed with:

labrun sleep 'py:range(20, 25)'

If the first argument starts with ‘py:’, then the following text is evaluated as Python code. The code must produce a list, and that list is taken to be the distributed command arguments.

For my project, I used a list comprehension to describe the different combinations of arguments to the program which I wanted to run. It looked something like:

labrun ~/project/benchmark 'py:["%s %s" % (bench_num, problem_size) for bench_num in range(1, 1001) for problem_size in range(3, 25, 3)]'

This executed each of the 1000 benchmarks for a few different instance sizes.

Download

Instructions

Download labrun and labpcs, make them executable, and place them somewhere in your $PATH (or specify the full path to them when running).

Possible further improvements

When someone else logs on to a lab computer it would be polite to stop using it for computation; as labrun stands, the new user can already restart the computer to stop it being used, but that’s not quite good enough.

I heard a rumour that it’s possible to remotely start machines in the labs. If that’s the case, then it might be a useful feature to add to labpcs.

Working from outside the labs

If you’d like to start the computation from a computer outside of the labs, a little extra work is needed because of the authentication system used in the department. You must first ensure your user has a passphrase-less SSH key created using ssh-keygen. Now, add your public key to the authorized keys file:

cat ~/.ssh/*.pub >> ~/.ssh/authorized_keys

This allows you to SSH into each lab computer once inside the departmental network (but not inside the labs) without entering a password, thus allowing labrun to function. You can now SSH to your-user-name@milan.cs.york.ac.uk and proceed as normal with labpcs/labrun.