Using Literate Programming in Research

Literate Programming by Donald Knuth
Literate Programming by Donald Knuth

Over my career in IT there have been a lot of changes in documentation practises, from the heavy detailed design up front to lean1 and now the adoption of literate programming, particularly in research (and somewhat contained to it because of the reliance on \LaTeX as a markup language2).  While there are plenty of getting started guides out there, this post is primarily about why I’m adopting it for my new Science and Innovations department and the benefits that literate programming can give.

Firstly, what is literate programming?  Simply put, the program is described in natural language, with the executable code embedded within the document, and was first proposed by Donald Knuth3.  The code can become part of the document or just the results, depending on what is needed.  Without further typesetting, the output is human readable4 that could be distributed around the lab or business.

When I first started coding5 I worked out my application in comments – this enabled me to see obvious logical blocks for functionalisation and where I might need extra inputs that may not have been obvious from the initial problem I was trying to solve.  I’d then fill in the code between my comments.  This made my code readable, logical decisions were obvious and applying the same technique to all new features meant I had consistent documentation whenever I needed to revisit any code, all absolutely fine when you’re working by yourself.

As I started my professional career and was exposed to waterfall process design documentation and change control, I saw the pain that came from detailed design by committee6, endless meetings and relatively little coding.  There was such a separation of design and creation that the documentation was not synchronised with the code, and changing the requirements led to an equally protracted change control process.

Then Agile development and all the related off shoots became popular – and these made sense: get lots of customer feedback and adapt to the changing requirements without the overhead of lengthy process.  I remember those transition times, some companies adapted easily, others less so.  For some it was a case of internal agile, but still needing to create large amounts of project documentation in a PRINCE2 format and seemingly endless Gantt charts to give enterprise customers that warm fuzzy feeling that the project was running to time.  The sprints were simply an excuse for the businesses to force a capacity onto the team that was inappropriate.  In its wake, lean development became the choice, eliminating “wasteful” tasks.  For many businesses, lean equated to no documentation whatsoever and large amounts of code was done without any record of the logic or rationale to get product out quickly.

Right now, I think most companies have found a version of agile that suits them, rather than following a rigid protocol, they’ve taken what works and improved it with each iteration7.  However, the documentation side is still a bit of a mire.

I was introduced to literate programming by one of my new science researchers.  One of the great benefits of running a business department like an academic lab is the wonderful collaboration and sharing of ideas that comes naturally out of an energy fuelled environment.  While the initial example was for R and \LaTeX8, we quickly found a python equivalent: Pweave.  The principles of literate programming are simple: combine your code and notes into a single beautiful document.  While I can see issues for production code, and obviously there are limitations as this is directed firmly toward data science problems, this way of doing things seems very natural to me:  Write up the problem in the way you want to present your logic and results, then embed your code within the document.  Your source code will contain all of the logic and comments required to understand what you have done (and most importantly why) and the output from running your code will be embedded in your document at the appropriate place.

One of the biggest reasons that I am asking my department to adopt this is that it enables reproducible research – not just when individuals in the team come to publish their results, but mainly for our day to day development.  Everything is checked into source control – if someone is absent, anyone else in the team can pick up their work without handover, which can sometimes be critical.  I’m very much excited by some of the rigour that using this process will being to my new department, and the benefits it will bring.

Besides, it’s a great excuse to unearth my 20 year old \LaTeX books 🙂

  1. That is, no documentation in most cases 🙂
  2. Which is predominantly only used by the academic community and those who have come from it.
  3.   If you don’t know who he is and work in computer science, you really should find out
  4.  And also as pretty as you want to make it.
  5. Well, for my PhD anyway, I’ll ignore all the stuff that came before as my coding style was not appropriate!
  6. I’m sure that there are better ways to do this, but this is my experience
  7. Which fundamentally is what the retrospective part of agile proposes 🙂
  8. Sweave http://users.stat.umn.edu/~geyer/Sweave/

Published by

janet

Dr Janet is a Molecular Biochemistry graduate from Oxford University with a doctorate in Computational Neuroscience from Sussex. I’m currently studying for a third degree in Mathematics with Open University. During the day, and sometimes out of hours, I work as a Chief Science Officer. You can read all about that on my LinkedIn page.