Monday 24 February 2014

Ten simple rules for effective computational research

With the increasing use of computational techniques in the life sciences, a wide range of scientists are finding that software development plays an increasing role in their day-to-day research. In this post Dr Alexander Fletcher discusses a forthcoming article on best practice when developing and using software for scientific research.


Introduction


To try to understand the dazzling complexity that is inherent in nature, mathematical and computational techniques are being used more and more widely in the life sciences. Computer programming is becoming unavoidable for scientists who want to stay up-to-date with the latest quantitative developments in their field. Yet the degree of practical training on this topic remains low in many degree courses. While a number of guides to software development exist, they are often aimed at computer scientists or concentrate on large open source projects. There is thus a need for guidance in this area aimed specifically at the vast majority of scientific researchers: those without formal training in computer science.

To address this issue, I and other postdoctoral Research Fellows associated with the ‘2020 Science’ project, funded jointly by the EPSRC and Microsoft Research Ltd, have come up with a jargon-free guide to best practice when developing and using software for scientific research. This guide [1], consisting of "ten simple rules", is due to appear in a forthcoming issue of PLoS Computational Biology. It is based on our collective experience of the challenges associated with carrying out multidisciplinary computation-based science. These "rules" are shown in more or less the order in which they should be considered, starting from before a computational scientists should even think about writing any code.

A brief summary of our "rules" is provided below. The manuscript has not been uploaded to a preprint server (not because we disagree with Jacob Scott's opinion on the issue, but I am afraid because no-one has gotten around to it), though a version has been self-archived by Kit Yates and can be found here. There are many other useful guides in the "Ten simple rules" series worth checking out [2, 3, 4, 5], as well as a recent PLoS Biol. community page on best practices for scientific computing [6].

Rule 1: Look before you leap


Before developing new code, it is worth performing a survey of what existing software toolboxes and libraries are out there that might be able to tackle your chosen problem. Software repositories such as GitHub and SourceForge are a good place to begin (as are your colleagues in the field).

Rule 2: Develop a prototype first


Before writing any code it is essential to be clear about what you are trying to implement: what functionality do you require, and what interfaces do you need? Whether extending existing code or starting from scratch, it is helpful to begin by considering a prototype (a simplified version of the full system or algorithm) to guide the next steps in code development. From a practical point of view it is often easier to prototype methods in a ‘higher level’ language such as MATLAB or R; the straightforward nature and built-in functionality in these languages can mean that you spend less time expressing your ideas in code and searching for bugs.


Rule 3: Make your code understandable by others (and yourself)


The absence of documentation and comments can cause a lot of grief when revising or adapting existing code! Such documentation not only makes your code more understandable to others but also to your future self (put simply, the code tells you ‘how’, the comments tell you ‘why’). The program code itself can be made more understandable by using meaningful variable names and formatting the code consistently. While commenting and documentation is often neglected when faced with deadlines, developing and maintaining a standardised way of commenting your code will be of great benefit.


Rule 4: Don’t underestimate the complexity of your task 


When developing your code you should keep a record of your work. This could be in the form of a ‘log-book’ file or a paper-based note-book where you store commonly used commands and other notes; another good option is an online tool such as EvernoteYou will often find that you have to choose between spending a long time doing a task by hand and possibly spending longer learning how to automate it. A good rule to follow is “the rule of three”: once you have had to do the same thing twice already, automate it.

Rule 5: Understand the mathematical, numerical and computational methods underpinning your work


When solving any computational model, you should always ensure that you are using the appropriate numerical method for your problem, and that any constraints and conditions are satisfied. A basic understanding of numerical analysis and in particular the concepts of rate of convergence, order, and stability of numerical methods will pay dividends. By fully understanding the mathematical and numerical methods being used you can be confident that your results reflect the true behaviour of the underlying model and are not numerical or computational artefacts.

Rule 6: Use pictures: they really are worth a thousand words


Visualisation and graphics are fundamental to developing understanding and testing hypotheses, and are indispensable for verifying and validating computational methods. It is worthwhile spending time developing the visual components of your work. Learn, develop and use visualisation software and tools to ensure that you understand your research outputs and can effectively communicate your findings. You may well need to develop novel visualisations for your work, but keep the basic figures. You needed them to understand your results, model, and implementation and so will anyone else.

Rule 7: Version control everything


Version control systems such as Subversion and Git offer an easy way to store and backup not only the current version of your code that you are working on but also every previous version of the code (in what’s known as a repository). This not only saves you from having to keep multiple copies of the same file but also allows you to ‘roll back’ to an older ‘working’ version of the code if things go wrong. Version control systems also allow you to share material between multiple machines, operating systems and more importantly users in a simple and robust manner. Cloud storage such as Dropbox and SkyDrive offer basic file sharing and backup facilities, but don’t offer the code management features of true version control systems.

Rule 8: Test everything


Any non-trivial computer program will have bugs when first written, often subtle ones that are hard to detect, which may lead to incorrect results. Simple tests that the software behaviour matches expectations are essential for ensuring robust results, minimising the presence of bugs and gaining confidence in your code (for you and others). As a result of the time pressures inherent in academia, often software testing is performed manually in an ad hoc manner, to determine whether results ‘look roughly right’. However, a systematic approach to testing can pay dividends.

Rule 9: Share everything


If an important part of your research involves developing new software tools and/or collecting new data, then you should consider sharing these. Based on our collective experience, we advocate an open approach of sharing source code, data and results as freely as possible. You should ask yourself, ‘why not share?’ If the answer is that, ‘I am worried that people would find mistakes in it’ then, as a scientist, this should be the strongest argument in favour of sharing it!

Rule 10: Keep going!


Our advice arises from our collective experience, and we continue to strive to obey these rules in our work. Scientists have a wide variety of demands (researching, writing papers, teaching, applying for grants, admin) and have to make the most of limited resources. Becoming more technically effective can seem daunting without strategies for making progress and keeping motivated. So, prioritise in a way that suits you and your projects and career aspirations. One strategy is to implement another of these rules each time you start a new project, to build a growing repertoire, rather than trying to do everything at once. Take every opportunity to teach and help others to do what you have learnt. 



References


  1. J.M. Osborne, M.O. Bernabeu, M. Bruna, B. Calderhead, J. Cooper, N. Dalchau, S.-J. Dunn, A.G. Fletcher, R. Freeman, D. Groen, B. Knapp, G.J. McInerny, G.R. Mirams, J. Pitt-Francis, B. Sengupta, D.W. Wright, C.A. Yates, D.J. Gavaghan, S. Emmott, C. Deane (2013). Ten simple rules for effective computational research. PLoS Comput. Biol. In press.
  2. W. Zhang (2014). Ten simple rules for writing research papers. PLoS Comput. Biol. 10(1): e1003453. doi:10.1371/journal.pcbi.1003453
  3. G.K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig (2013). Ten simple rules for reproducible computational research. PLoS Comput. Biol. 9(10): e1003285. doi:10.1371/journal.pcbi.1003285
  4. A. Prlić, J.B. Procter (2012). Ten simple rules for the open development of scientific software. PLoS Comput. Biol. 8(12): e1002802. doi:10.1371/journal.pcbi.1002802
  5. P.E. Bourne (2011). Ten simple rules for getting ahead as a computational biologist in academia. PLoS Comput. Biol. 7(1): e1002001. doi:10.1371/journal.pcbi.100200
  6. G. Wilson, D.A. Aruliah, C.T. Brown, N.P. Chue Hong, M. Davis, R.T. Guy, S.H.D. Haddock, K.D. Huff, I.M. Mitchell, M.D. Plumbley, B. Waugh, E.P. White, P. Wilson (2014). Best practices for scientific computing. PLoS Biol. 12(1): e1001745. doi:10.1371/journal.pbio.1001745

4 comments:

  1. Many thanks for sharing! I agree with all of the points you raise, especially with understanding the methods, version control and testing all the components. I would add an eleventh point, namely modularise your code. Anticipating what questions one might want to ask of the model in the future and making everything configurable and scriptable accordingly saves time later. It also avoids unmaintable or unreadable code.

    ReplyDelete
    Replies
    1. Thanks Uli! I completely agree and think working in a modular fashion can pay dividends both in terms of working efficiency and testing.

      Delete
  2. Nice post Alex, you guys have been talking with the authors of this?
    http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001745

    ReplyDelete
    Replies
    1. Thanks David! I cite that paper in the blog post (reference 6) and think it's a useful guide.

      Delete