For the past two months or so, I’ve been slowly migrating my scientific workflow (that’s a fancy way of saying “my chaotic data hacking”) from Matlab ((R) (TM) (C)) to Python. The results are overwhelmingly positive, so I’d like to rant about it a bit. First, some background.
My work typically involves the analysis of tons of remote sensing observations contained in files of various formats (netCDF if I’m very lucky, HDF if I’m lucky, some weird non-standard binary thing if I’m not); all these files span terabytes and terabytes of hard drive space stored in racks in a big temperature-controlled room somewhere high in the sky. I ssh to a central server on which all these drives are mounted; I then usually run there code in whatever language is the most convenient to analyze the data.
After a few years of this, Matlab emerged as the best solution for several reasons:
- interactive sessions let you play with the data and make the analysis algorithms “evolve” (the analysis procedure is often not cast in stone and writes itself as I go along and understand the data better);
- the syntax is well-suited to work with numerical arrays (ie vectorized code, something also present in f90 but where it sometimes gives buggy results);
- powerful input/output facilities, reading netCDF and HDF is as easy as ncload file.nc or hdfread(‘file.hdf’, ‘some_variable’), without all the administrative overhead of compiled languages (memory management, static typing, etc). This is an important point, in Fortran it often takes me as long to get the I/O right than the actual algorithm;
- powerful plotting capabilities give you immediate visual feedback following your choices.
For me, this means I usually get results quicker with an interpreted language like Matlab’s, even taking into account the higher speed of compiled code like Fortran. A nice side-effect is that working suddenly becomes a lot more enjoyable when I don’t have to spend so much time remembering all the Fortran idiosyncrasies, the differences between compilers (will this code work with ifort/gfortran/g95/pghpf/etc?) and which libraries to link to, fixing messy mixes of f77 and f90 syntax to give a predictable output, etc. Once I run the actual program, I know it would have been faster using a compiled language, but it would have taken me longer to get right and the coding would have been a lot less fun.
Now, the problems. Matlab is not free as in speech, meaning you often can’t see the code. Matlab is not free as in beer, meaning our institution owns a limited number of licenses, meaning that during student rush hours you often can’t even launch Matlab at all. The initial goal of Matlab was the analysis of matrices (hence the “mat”), not general arrays, which makes the code look weird in places and explains the FUCKING SEMICOLON you have to append to every instruction to prevent millions of numbers no human will ever be able to read to flash in front of your eyes. The fact that you cannot launch standalone Matlab scripts without fishy syntax like “matlab -nodisplay < script.m”. Because of this (mostly because of the non-free), for some time now I have been looking for a replacement. I’ve tried R, Scilab, Octave, and tons of other stuff, but every time I’ve found the language and the plotting capabilities to be worse when I was hoping for, at least, similar (I guess I was also somewhat reluctant to learn another closed-system language).
But somewhere I always hid a secret wish… to use Python. I love its syntax, focus on simplicity and readability, but it lacked by default any capability for serious number crunching, so I had been patiently waiting from the sidelines for the maturation of Python packages for scientific work. Well, the stars are now aligning.
A year ago or so I took another look at Python’s scientific stack. I liked what I saw: everything good in Matlab (see above), without the annoyances, and free. But trying to get things running I got lost in the mess of version numbers and a never-ending chain of interdependent packages, which is even more fun when you have no root access and the machine you’re using comes with Python 1.5.2 and that’s it, no chocolate for you. Basically, you have to recompile everything by hand, and make sure you don’t forget that crucial compilation flag somewhere! Unfortunately I had other things to do (like actual work), so I reluctantly let go of the idea and stuck with Matlab.
Then came SAGE
Fast-forward to 2 or 3 months ago, when I stumble upon SAGE. SAGE (apart from being a RSS aggregator for Firefox and a satellite instrument) is basically a wrapper around Python with tons of scientific packages added, all nicely pre-compiled into tasty binaries just for you by very nice people (which involves tons of work, not as simple as it sounds). These goodies come in gzipped tarballs that you dump into your $HOME. You can then launch the sage program, which handles regular Python just fine and includes all the modules I was longing for: NumPy (easy, efficient handling of huge numerical array with slicing and dicing), SciPy (input/output and scientific functions), Matplotlib (lots of plotting tools with lickable, anti-aliased output and a syntax almost identical to Matlab)! Even IPython is there, meaning you get a comfortable interactive experience with tab completion on files, objects, dictionaries and tons of other niceties! Since SAGE lets you install additional packages with a single command, it’s a piece of cake to add wxPython to get direct-to-screen plotting within your interactive session. Apotheose! Great success! Matlab without Matlab. AND it’s Python, meaning you’re using an actual, REAL language with object-oriented programming, introspection, dictionaries, etc. And since Python fits your brain, the first code you come up with is most likely the right one. As a bonus, SAGE is available for linux (32/64), Mac OS X and even Windows (I think) so your code will work everywhere! Bliss.
The best part was when I realized, a few hours later, that I actually didn’t need to use the SAGE program itself… inside the SAGE directory lies a local/ folder containing all the binaries, libraries and Python packages it used. It even contains its own Python 2.5! Set the PATH, LD_LIBRARY_PATH and PYTHONPATH environment variables right and suddenly you have a perfectly consistent installation of everything that’s needed to do scientific work in Python! Other users on the same machine just need to change the same variables, and they can play too! Apotheose²! So in addition to its primary goals of providing a replacement for Mathematica/Maple/etc, SAGE, as a side-effect, provides the whole Python scientific shebang compiled and wrapped up in a nice package, for your pleasure.
Since Python is pretty smart, new Python modules will then install themselves in the right place with python setup.py install. So go ahead: install Basemap, netcdf4-python, PyHDF, scipy-cluster, PyNGL, whatever you need.
(Sidenote: Other “integrated” Python distributions with a similar focus on scientific analysis are starting to pop up, like Python(x,y) or the Enthought Python Distribution. Travis Oliphant, one of the major architect of the recent NumPy restructuration, is now president of Enthought. They also hosts the SciPy website; you can’t get more central than that. Interesting stuff should happen there soon. They are a little too window-centric, though.)
Migrating my last work project from Matlab to Python has been a success: all the figures in my last paper were generated in Python, they look almost exactly the same as the ones generated in Matlab (just as good or better — fonts are noticeably nicer thanks to anti-aliasing), and the code is as small and feels better. It seems like the only thing you could miss from Matlab are its numerous toolboxes, something which is slowly getting fixed within SciPy (I don’t actively use them so I don’t care). Adieu point-virgule!
Of course now that I’ve been bitten by the Python bug, I’m starting to follow the NumPy, SciPy and Matplotlib mailing lists. Some great things are afoot, like the imminent NumPy 1.1 (previously 1.0.5, including shiny masked arrays, histograms and I/O), the release of Travis Oliphant’s Guide to NumPy in august 2008, lots of integration and standardization efforts between the various components, etc.
I guess the best thing is that it made me excited again about the idea of hacking stuff…