Friday, April 26, 2013

Benchmarking remote data access

A while back I mentioned SkeletonKey and how it uses Parrot and Chirp to allow users to remotely access software and data.  This blog post will go through some of the benchmarking that I did to see what sort of performance trade-offs are needed in order to access data remotely.

In order to do this, I conducted some benchmarks to see how performance differs between using local file access and remotely accessing the same filesystem using Chirp.  The tests all wrote to a system with a 3 drive RAID0 array that provided raw speeds of about 300MB/s.  Then 64 clients were set to read or write 10GB files on the array either locally or remotely using Chirp from a nearby cluster that had a 10Gbit network connection to the server.  This was done several times and the time required to read or write the files were recorded and a histogram of the times created.

The two plots below show the time it takes to read a 10GB file from the RAID array or remotely using Chirp.  The clients accessing the files locally had some interesting behaviour with the mean time for completion being about 3025s but with a group of clients completing just under 3000s and a similar number of clients completing much faster at in about 2700s.  The clients reading the files remotely had a mean time for completion of about 3050s and completion times were mostly clustered around this time.


The plots for write performance when writing 10GB to separate files are shown below. Here the clients writing locally cluster around completion times of ~2575s.  The completion times for clients writing remotely have a mean  value of about ~2900s  although there's a much wider spread in completion times in this case.
Looking at the plots, it turns out the the overhead for remotely accessing data through Chirp is fairly low.  It's about ~10% for reading and about ~18%  for writing.  All in all, it's a fairly reasonable tradeoff for being able to access your data from other systems!

Monday, April 1, 2013

Installers and more

Before talking about performance, I'd like to talk a bit about installers.  From a development point of view, installers are for the most part not that exciting.  The bulk of an installer's code involves copying files around and possibly changing system settings.  This is all fairly standard and relatively well understood, but has a lot of corner cases that can cause bad things to happen™ (data loss, opening up security holes, etc).  Fortunately, there's standard installers for this like Wise on windows and rpm, debs, etc. on Linux.  However, SkeletonKey is targeted toward users and therefore can't use these, so I needed to roll something for my self.

Turns out that python's batteries included philosophy makes for a nice adhoc installer. SkeletonKey's installer needed a few things, to be able to figure out and download the latest version of the CCTools tarball; then to install it in an user specified location; and finally to download and install the SkeletonKey scripts.  Using the urllib2 module, the installer can download pages and tarballs from the web using
urllib2.urlopen(url).read()
to get the html source for a page for scraping.  Downloading a tarball to use is a bit trickier, but not much so:
(fhandle, download_file) = tempfile.mkstemp(dir=path)
url_handle = urllib2.urlopen(url)
url_data = url_handle.read(2048)
while url_data:
  os.write(fhandle, url_data)
  url_data = url_handle.read(2048)
os.close(fhandle)
The tarfile module also comes in handy as well when it's time to untar packages and do some examination of the contents of downloaded tarballs.  The first thing is to do some inspection of the tarball.  Luckily, all the installer only deals with tarballs that put everything into a single parent directory so the following code gives us the directory the files will be extracted to:
downloaded_tarfile = tarfile.open(download_file)
extract_path = os.path.join(path, downloaded_tarfile.getmembers()[0].name)
Once we have this, we can use the extractall method to untar everything. The tarfile module also provides some convenient tools for repackaging things into a new tarball that users can then subsequently use.  There's some other things that need to be checked (in particular, the extractall method needs sanity checking on the tarfile that you're checking otherwise someone can put an absolute path or something like ../../../../../../etc/passwd in there to do bad things), but the python modules provide a lot of help with doing some of the more complicated tasks of an installer like downloading files from the web and opening up installation packages.  

Monday, March 18, 2013

Playing around with remote software and data access

Playing around with SkeletonKey, Parrot, and Chirp

One of the projects, I've been working on is called SkeletonKey.  SkeletonKey is a tool that lets people create scripts that will run their applications in an environment that allows for remote software and data access.  Just as an example, suppose you're interested in analyzing the temperature data from the last 100 years and generating some graphics based on that.  You could download all data to your computer and then run a program to go through all of the text.  After a waiting a while, you'd get your results back.

Or if you had access to a cluster, you could split the task up and submit it to a few hundred cores and get the results back much more quickly than running the application on your personal computer.  The only problem is that you may have terabytes of data and your application may be a few gigabytes in size and you'd rather not have to transfer all of this over to the cluster and then convince the administrators that they should install your application so that you can use it.

That's where Parrot and Chirp come in.  Chirp allows you to export your data from your system to remote computers over the internet.  Parrot lets you run your application in an environment that intercepts local file access and transparently turns it into remote network access.  This is all done in user space so you can even use Parrot to run a shell script that  then runs an application that's actually located elsewhere.  So if you can run a shell script and access the web, you can run an application and get read/write access to data from remote sources without having to install a bunch of libraries and binaries or transfer large amounts of data in to do your work.

SkeletonKey works with Parrot and Chirp to generate a shell script that will do all the legwork for you.  I.e. you give SkeletonKey a simple ini file with your configuration file and it'll generate a shell script that'll download your application as well as the appropriate Parrot binaries and then run your application in a Parrot environment that has all of the software and data access you may want.

I have more information on how well this compares to accessing your files or data locally but that'll need to wait until the next entry.