The source of this document is available on gitlab.
Last version: 2020-03-23

Additional references

Table of Contents

"Thoughts" on language/software stability

As we explained, the programming language used in an analysis has a clear influence on the reproducibility of your analysis. It is not a characteristic of the language itself but rather a consequence of the development philosophy of the underlying community. For example C is a very stable language with a very clear specification designed by a committee (even though some compilers may not respect this norm).

On the other end of the spectrum, Python had a much more organic development based on a readability philosophy and valuing continuous improvement over backwards-compatibility. Furthermore, Python is commonly used as a wrapping language (e.g., to easily use C or FORTRAN libraries) and has its own packaging system. All these design choices tend to make reproducibility often a bit painful with Python, even though the community is slowly taking this into account. The transition from Python 2 to the not fully backwards compatible Python 3 has been a particularly painful process, not least because the two languages are so similar that is it not always easy to figure out if a given script or module is written in Python 2 or Python 3. It isn't even rare to see Python scripts that work under both Python 2 and Python 3, but produce different results due to the change in the behavior of integer division.

R, in comparison is much closer (in terms of developer community) to languages like SAS, which is heavily used in the pharmaceutical industry where statistical procedures need to be standardized and rock solid/stable. R is obviously not immune to evolutions that break old versions and hinder reproducibility/backward compatibility. Here is a relatively recent true story about this and some colleagues who worked on the statistics introductory course with R on FUN reported us several issues with a few functions (plotmeans from gplots, survfit from survival, or hclust) whose default parameters had changed over the years. It is thus probably good practice to give explicit values for all parameters (which can be cumbersome) instead of relying on default values, and to restrict your dependencies as much as possible.

This being said, the R development community is generally quite careful about stability. We (the authors of this MOOC) believe that open source (which allows to inspect how computation is done and to identify both mistakes and sources of non-reproducibility) is more important than the rock solid stability of SAS, which is proprietary software.

Yet, if you really need to stay with SAS, you should know that SAS can be used within Jupyter using the Python SASPy and the Python SASKernel packages (step by step explanations about this are given here). Using such literate programming approach allied with systematic version and environment control will always help. Similar solutions exist for many languages (list of Jupyter kernels).

Controlling your software environment

As we mentioned in the video sequences, there are several solutions to control your environment:

It may be hard to understand the difference between these different approaches and decide which one is better in your context.

Here is a webinar where some of these tools are demoed in a reproducible research context: Controling your environment (by Michael Mercier and Cristian Ruiz)

You may also want to have a look at the Popper conventions (webinar by Ivo Gimenez through google hangout) or at the presentation of Konrad Hinsen on Active Papers (http://www.activepapers.org/).

Preservation/Archiving

Ensuring software is properly archived, i.e, is safely stored so that it can be accessed in a perennial way, can be quite tricky. If you have never seen Roberto Di Cosmo presenting the Software Heritage project, this is a must see. https://www.softwareheritage.org/

If you want to archive your own code via Software Heritage, there are two ways to proceed:

  1. You put your code into a public repository at http://github.com, http://gitlab.com, or http://bitbucket.org. And then you simply wait for Software Heritage to pick it up. The only downside is the delay (up to several months) before you can be sure that your code is archived and before you can cite it via a SWH identifier.

  2. Put your code into a public repository managed using Git, Mercurial, or Subversion. Enter its URL at https://archive.softwareheritage.org/save/. The delay is typically a few hours, and you can watch the status of your request at any time.

For regular data, we highly recommend using https://www.zenodo.org/ whenever the data is not sensitive.

Workflows

In the video sequences, we mentioned workflow managers (original application domain in parenthesis):

You may want to have a look at this webinar: [[https://github.com/alegrand/RR_webinars/blob/master/6_reproducibility_bioinformatics/index.org][Reproducible Science in Bio-informatics: Current Status, Solutions and Research Opportunities (by Sarah Cohen Boulakia, Yvan Le Bras and Jérôme Chopard).]]

Numerical and statistical issues

We have mentioned these topics in our MOOC but we could by no way cover them properly. We only suggest here a few interesting talks about this.

Publication practices

You may want to have a look at the following two webinars:

Experimentation

Experimentation was not covered in this MOOC, although it is an essential part of science. The main reason is that practices and constraints can vary so wildly from one domain to another that it could not be properly covered in a first edition. We would be happy to gather references you consider as interesting in your domain so do not hesitate to provide us with such references by using the forum and we will update this page.