Documented and repeatable "mini-experiments"

In this post I outline my workflow on how I conduct my "mini-experiments", code and notes that may otherwise be "throw-away". You can see and example of what this looks like here.

Org Mode
- Org Mode for computational experimentation
Nix
Setting up an experimental environment
- Citations
Hosting with Docker
Adding an IPython Jupyter Notebook
Conclusion

When learning, using and creating algorithms and computational tools, an important part of the process is playing around. This involves getting to know the tools, specific commands, trying out ideas, etc. Typically, for me, this might involve creating a bunch of python files with half-meaningful names in a folder with some input data and result files. Unfortunately, this does not leave my findings in a very usable format. If I months later decide I want to look over a similar idea or pass my findings to someone else I have to trawl back through these files, or redo the experiments.

What is needed is a way to organise these ideas together, embedding the test code and results within a document. An easy way to do this is to simply copy-paste any code and results into a text or Word document with some comments. Presented in this post is an alternative workflow, which I find far more satisfactory. In this workflow, thanks to Org Mode, the code is embedded within the document, along with the results.

At our workplace we have a DNS server, that means it is very easy to remotely connect to a computer without remembering an IP address. We also use a VPN. This means that if I use my computer to host my notes over HTTP, I can easily get access to my notes anywhere in the world, by simply connecting over the VPN, and typing my computer's name into a browser.

Org Mode

Org Mode is an Emacs major mode. Emacs operates in "modes", which are context-dependant (i.e., change depending on what type of file is open). Org Mode works with .org files, which are basically just documents.

Org mode is designed to keeping notes, TODO lists and plans [1]. Hence, it is named Org Mode, designed to be an environment for organisation. As such, it has a strong focus on being able to create and manipulate lists quickly, and to track dates using a calendar.

Whilst Org Mode as an organisational tool is very effective, this is not the primary reason I use it. To me, Org Mode is a simple markup language. Some would argue that Org Mode markup syntax is not as "nice" as markdown's, I would agree with them. However, the two syntaxes are indisputably similar in concept. What Org Mode has over competing markup languages like markdown and ReStructured Text is the host of features that come with it. Importantly to me include:

the ability to export (accurately and consistently) to a wide range of formats, including PDF (for papers), HTML (read on), and markdown (for this blog); and
the ability to embed and run code within the document using babel. In fact, you can even embed code in one language, and pass the evaluated results to another embedded code snippet in an entirely different language.

Org Mode for computational experimentation

Babel is specifically designed with literate programming and reproducible research in mind [2]. It has also been recommended in a number of other workflows for reproducible research [3][4]. This article outlines my workflow for my own personal notes, but uses ideas from work presented in these papers.

Nix

Nix is a package manager, much like apt-get/dpkg or Homebrew. The main feature of Nix I find useful is its ability to:

support multiple versions of an installed package; and
provide a shell with specific packages installed and nothing extra, using nix-shell --pure.

Nix is able to achieve this functionality through its unique implementation: Nix is a purely functional package manager [5]. This means it is able to consistently depend on specific versions of software, and ensure reproducible environments. In actuality, Nix builds often download source from the internet, which may, of course, become unavailable. So rather than guaranteeing reproducible builds, it actually guarantees that if an environment builds, it is identical, and otherwise will not build. Nonetheless, it is a useful tool as a package manager providing multiple environments.

Setting up an experimental environment

I keep all my experiments in my home directory, ~/experiments. This directory looks like:

$ tree ~/experiments

|-- 2016-03-13-random-exploration
|   `-- index.org
`-- template.org

The first important file here is template.org. Using this file, I can quickly start a new experiment using:

mkdir 2016-03-13-random-exploration
cp template.org 2016-03-13-random-exploration/index.org

Lets have a look at its contents.

#+TITLE:
#+AUTHOR: Ashley Gillman
#+EMAIL: ashley.gillman@csiro.au
#+OPTIONS: ^:{}
#+HTML_LINK_HOME: /
#+HTML_LINK_UP: ..
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="/style.css">
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="https://cdn.rawgit.com/dreampulse/computer-modern-web-font/master/fonts.css">

* Setup                                                            :noexport:
#+BEGIN_SRC nix :tangle default.nix
  let
    pkgs = import /home/ash/repo/nixpkgs {};
  in
  { stdenv ? pkgs.stdenv, pythonPackages ? pkgs.python34Packages }:

  stdenv.mkDerivation {
    name = "python-nix";
    buildInputs = [ pythonPackages.python
                    pythonPackages.scipy
                    pythonPackages.numpy
                    pythonPackages.matplotlib ];
  }
#+END_SRC

* Directory listing
#+BEGIN_SRC python :results output raw replace :exports results
  from pathlib import Path
  link_format = '- [[file:{0}][={0}=]]'.format
  print(*(link_format(p.name + ('/' if p.is_dir() else ''))
          for p in sorted(Path('.').iterdir())
          if not p.name.startswith(('.', '#'))),
        sep='\n')
#+END_SRC

* Aim
* Methodology

* Local Variables                                                 :noexport:
Local Variables:
org-export-babel-evaluate       : nil
org-confirm-babel-evaluate      : nil
org-html-link-org-files-as-html : nil
org-html-postamble-format       : '( \
  ("en" " <p class=\"author\"  >Author: %a (%e)</p>\n \
          <p class=\"date\"    >Date: %T</p>\n \
          <p class=\"creator\" >%c</p>\n \
          <p                   ><a href=\"/\">Home</a></p>"))
org-babel-python-command        : "\
  /home/ash/.nix-profile/bin/nix-shell \
    --pure \
    --command python3"
eval: (require 'ox-bibtex)
End:

The first block of code is some standard templating, setting myself as the author, and my email address. Options keywords can be found here. The next section, under the Setup heading is interesting. The :noexport: tag means that this section will not appear in the exported document. However, it does contain a source block with a Nix expression. I have this set up to, by default, set up a basic Python 3 environment. Doing so ensures that we know exactly what libraries our experiments are using, and ensures that even years later we will be able to repeat our experiments.

The Directory Listing section simply contains a python script that will provide a link to each file and folder in the directory. This is just for convenience when later exploring the results. The Aim, Methodology and Results headings are empty, just providing placemarkers for later. Finally, the Local Variables sets up Emacs file-local variables. Here I instruct org-mode to evaluate all results when the file is exported (this may need to be changed at some point if the code takes a long time to run), disable confirmation messages (be careful if you didn't write the code), allow links to .org files, and set the HTML footer. Lastly, and importantly, I change the python command to run via =nix-shell –pure", which uses the environment defined in the Setup section.

I have hosted an example with some toy experiments at http://ashgillman.github.io/experiments/. The source code for one such experiment can be seen here, and its rendered output, here. Great!

The index file at http://ashgillman.github.io/experiments/ is generated using gen_index.py. Let's have a look at its source:

#!/usr/bin/env python3

from pathlib import Path
from datetime import datetime

html_format = """<body>
<h1>Private Repository of Ashley Gillman</h1>
{}
<p><i>Generated {}</i></p>
</body>
""".format
site = '.'
doc_links = ['*.pdf']
link_format = '<p><a href="./{0}">{0}</a></p>'.format

hard_links = '<p><a href="/" onclick="javascript:event.target.port=8888;event.target.protocol=\'https:\'">iPython Notebook</a></p>'
subdir_links = '\n'.join(sorted([link_format(d.name)
                                 for d in Path(site).iterdir()
                                 if d.is_dir()]))
file_links = '\n'.join(sorted([link_format(f.name)
                               for pattern in doc_links
                               for f in Path(site).glob(pattern)]))

html = html_format( '\n'.join([hard_links, subdir_links, file_links]),
                   datetime.now().strftime('%d %b, %Y'))

with open(str(Path(site, 'index.html')), 'w+') as f:
    f.write(html)

This is just a very simple script to make a very simple index. You mightn't even want to use it, opting instead for something like Apache's default indexing.

Citations

Using ox-bibtex.el, it is also possible to include citations when exporting to HTML just as you would when exporting to PDF, using TeX markup. ox-bibtex is already imported for us in through template.org under the Local Variables. The bibliography is included by simply using:

#+BIBLIOGRAPHY: bibfilename stylename

and citations are inserted using \cite{}. See the source code for this blog for examples.

Hosting with Docker

Docker is a virtualisation tool, allowing you to run a service as if it were running on a virtual machine, without the overhead of an actual virtual machine. But also, importantly, Docker has access to the Docker Hub, which allows you to very quickly fire up containers to run common services. I have found the simplest way to launch the server is using Docker. Once Docker has been installed, the Apache HTTP daemon can be launched (and configured to relaunch on restart) using one command:

docker run --name private-server \
  -v /home/ash/experiments:/usr/local/apache2/htdocs -p 80:80 \
  --restart=always -d httpd

This starts up a container named private-server, running an Apache HTTP server serving from the experiments folder, and serving on port 80, the default HTTP port. The container will also try and restart itself if it errors, or if you restart your computer, etc.

Adding an IPython Jupyter Notebook

I sometimes find it more convenient to work from an IPython Notebook than from within Org Mode, I find it a bit easier to debug and tune Matplotlib plots for example. You can also very easily host one of these using Docker.

docker run --name ipython-server -d -p 8888:8888 \
  -v /home/ash:/home/ash -v /home/ash/notebooks:/notebooks \
  --restart=always -d ipython/scipyserver

This install includes the SciPy stack, which includes SciPy, NumPy, etc. I actually use a slightly different version, with a few extra packages installed.

docker run --name ipython-server -d -p 8888:8888 \
  -v /home/ash:/home/ash -v /home/ash/notebooks:/notebooks \
  --restart=always -d gil2a4/mipython

You may also have noted that gen_index.py includes a hard-coded inclusion to add a link to port 8888. This makes it a little easier to access the server. The Jupyter notebook will only be accessible through HTTPS, and you will have to click through a warning that the certificate is invalid. Otherwise, it works perfectly.

Conclusion

Included here is a rough outline of how I have my environment set up to be able to document and record my experiments, and provide some formality in their structure. Although still not perfect, I find this approach to have a nice balance between structure and flexibility, providing scaffolding to test things quickly.

If you require more information, you may be able to find it by checking through some of the org source code I have available. Useful links include:

The example version of this approach: https://github.com/ashgillman/experiments
This blog's source: https://github.com/ashgillman/ashgillman.github.io/tree/master/_posts
My ~/.emacs.d folder: https://github.com/ashgillman/dotfiles/tree/master/emacs.d

References

[1]	C. Dominik, The Org Manual. Network Theory Ltd., 8.3.4 ed., 2016.
[2]	E. Schulte and D. Davison, “Active documents with org-mode,” Computing in Science & Engineering, vol. 13, no. 3, pp. 66--73, 2011.
[3]	M. Delescluse, R. Franconville, S. Joucla, T. Lieury, and C. Pouzat, “Making neurophysiological data analysis reproducible: Why and how?,” Journal of Physiology-Paris, vol. 106, no. 3, pp. 159--170, 2012.
[4]	L. Stanisic, A. Legrand, and V. Danjean, “An effective git and org-mode based workflow for reproducible research,” ACM SIGOPS Operating Systems Review, vol. 49, no. 1, pp. 61--70, 2015.
[5]	E. Dolstra and A. Hemel, “Purely functional system configuration management.,” in HotOS, 2007.

Table of Contents