3. User guide

This user guide is meant for researchers who would like to structure their data analysis and run them on REANA cloud.

3.1. Reusable analyses

Making a research data analysis reproducible basically means to provide structured “runnable recipes” addressing (1) where is the input data, (2) what software was used to analyse the data, (3) which computing environments were used to run the software and (4) which computational steps were taken to run the analysis. This will permit to instantiate the analysis on the computational cloud and run the analysis to obtain its (5) output results.

3.2. Four questions

REANA helps to make the research analysis reproducible by providing a structure helping to answer the “Four Questions”:

  1. What is your input data?
    • input data files
    • input parameters
    • live database calls
  2. Which code analyses it?
    • custom analysis code
    • analysis frameworks
    • Jupyter notebooks
  3. What is your environment?
    • operating system
    • software packages and libraries
    • CPU and memory resources
  4. Which steps did you take?
    • simple shell commands
    • complex computational workflows
    • local and/or remote task execution

Let us see step by step on how we could go about making an analysis reproducible and run it on the REANA platform.

3.3. Structure your analysis

It is advised to structure your research data analysis sources to clearly declare and separate your analysis inputs, code, and outputs. A simple hypothetical example:

$ find .
data/mydata.csv
code/mycode.py
docs/mynotes.txt
results/myplot.png

Note how we put the input data file in the data directory, the runtime code that analyses it in the code directory, the documentation in the docs directory, and the produced output plots in the results directory.

Note that this structure is fully optional and you can use any you prefer, or simply store everything in the same working directory. You can also take inspiration by looking at several real-life examples in the Examples section of the documentation.

3.4. Capture your workflows

Now that we have structured our analysis data and code, we have to provide recipe how to produce final plots.

Simple analyses

Let us assume that our analysis is run in two stages, firstly a data filtering stage and secondly a data plotting stage. A hypothetical example:

$ python ./code/mycode.py \
    < ./data/mydata.csv > ./workspace/mydata.tmp
$ python ./code/mycode.py --plot myparameter=myvalue \
    < ./workspace/mydata.tmp > ./results/myplot.png

Note how we call a given sequence of commands to produce our desired output plots. In order to capture this sequence of commands in a “runnable” or “actionable” manner, we can write a short shell script run.sh and make it parametrisable:

$ ./run.sh --myparameter myvalue

In this case you will want to use the Serial workflow engine of REANA. The engine permits to express the workflow as a sequence of commands:

    START
     |
     |
     V
+--------+
| filter |  <-- mydata.csv
+--------+
     |
     | mydata.tmp
     |
     V
+--------+
|  plot  |  <-- myparameter=myvalue
+--------+
     |
     | plot.png
     V
    STOP

Note that you can run different commands in different computing environments, but they must be run in a linear sequential manner.

The sequential workflow pattern will usually cover only simple computational workflow needs.

Complex analyses

For advanced workflow needs we may want to run certain commands in parallel in a sort of map-reduce fashion. There are many workflow systems that are dedicated to expressing complex computational schemata in a structured manner. REANA supports several, such as CWL and Yadage.

The workflow systems enable to express the computational steps in the form of Directed Acyclic Graph (DAG) permitting advanced computational scenarios.

              START
               |
               |
        +------+----------+
       /       |           \
      /        V            \
+--------+  +--------+  +--------+
| filter |  | filter |  | filter |   <-- mydata
+--------+  +--------+  +--------+
        \       |       /
         \      |      /
          \     |     /
           \    |    /
            \   |   /
             \  |  /
              \ | /
            +-------+
            | merge |
            +-------+
                |
                | mydata.tmp
                |
                V
            +--------+
            |  plot  |  <-- myparameter=myvalue
            +--------+
                |
                | plot.png
                V
               STOP

We pick for example the CWL standard to express our computational steps. We store the workflow specification in the workflow directory:

$ find workflow
workflow/myinput.yaml
workflow/myworkflow.cwl
workflow/step-filter.cwl
workflow/step-plot.cwl

You will again be able to take inspiration from some real-life examples later in the Examples section of the documentation.

To pick a workflow engine

For simple needs, the Serial workflow engine is the quickest to start with. For regular needs, CWL or Yadage would be more appropriate.

Note that the level of REANA platform support for a particular workflow engine can differ:

Engine Parametrised? Parallel execution? Caching?
CWL yes yes no(1)
Serial yes no yes
Yadage yes yes no(1)
  1. The vanilla workflow system may support the feature, but not when run via REANA environment.

Develop workflow locally

Now that we have declared our analysis input data and code, as well as captured the computational steps in a structured manner, we can see whether our analysis runs in the original computing environment. We can use the helper wrapper scripts:

$ run.sh

or use workflow-specific commands, such as cwltool in case of CWL workflows:

$ cwltool --quiet --outdir="./results" \
     ./workflow/myworkflow.cwl ./workflow/myinput.yaml

This completes the first step in the parametrisation of our analysis in a reproducible manner.

3.5. Containerise your environment

Now that we have fully described our inputs and code and the steps to run the analysis and produce our results, we need to make sure we shall be running the commands in the same environment. Capturing the environment specifics is essential to ensure reproducibility, for example the same version of Python we are using and the same set of pre-installed libraries that are needed for our analysis.

The environment is encapsulated by means of “containers” such as Docker or Singularity.

Using an existing environment

Sometimes you can use an already-existing container environment prepared by others. For example python:2.7 packaging Python 2.7. In this case you simply specify the container name and the version number in your workflow specification and you are good to go.

This is usually the case when your code can be considered “runtime”, for example Python scripts or ROOT macros.

Note that REANA platform offers a set of containers for certain popular environments such as ROOT RooFit. FIXME

Building your own environment

Other times you may need to build your own container, for example to add a certain library on top of Python 2.7. This is the most typical use case that we’ll address below.

This is usually the case when your code needs to be compiled, for example C++ analysis.

If you need to create your own environment, this can be achieved by means of providing a particular Dockerfile:

$ find environment
environment/myenv/Dockerfile

$ less environment/Dockerfile
# Start from the Python 2.7 base image:
FROM python:2.7

# Install HFtools:
RUN apt-get -y update && \
    apt-get -y install \
       python-pip \
       zip && \
    apt-get autoremove -y && \
    apt-get clean -y
RUN pip install hftools

# Mount our code:
ADD code /code
WORKDIR /code

You can build this customised analysis environment image and give it some name, for example johndoe/myenv:

$ docker build -f environment/myenv/Dockerfile -t johndoe/myenv .

and push the created image to the DockerHub image registry:

$ docker push johndoe/myenv

Testing the environment

We now have a containerised image representing our computational environment that we can use to run our analysis in another replicated environment.

We should test the containerised environment to ensure it works properly, for example whether all the necessary libraries are present:

$ docker run -i -t --rm johndoe/myenv /bin/bash
container> python -V
Python 2.7.15
container> python mycode.py < mydata.csv > /tmp/mydata.tmp

Multiple environments

Note that various steps of the analysis can run in various environments; the data filtering step on a big cloud having data selection libraries installed, the data plotting step in a local environment containing only the preferred graphing system of choice. You can prepare several different environments for your analysis if needed.

3.6. Write your reana.yaml

We are now ready to tie all the above reproducible elements together. Our analysis example becomes:

$ find .
code/mycode.py
data/mydata.csv
docs/mynotes.txt
environment/myenv/Dockerfile
workflow/myinput.yaml
workflow/myworkflow.cwl
workflow/step-filtering.cwl
workflow/step-plotting.cwl
results/myplot.png

There is only thing that remains in order to make it runnable on the REANA cloud; we need to capture the above structure by means of a reana.yaml file:

version: 0.4.0
inputs:
  files:
    - code/mycode.py
    - data/mydata.csv
  parameters:
    myparameter: myvalue
workflow:
  type: cwl
  file: workflow/myworkflow.cwl
outputs:
  files:
    - results/myplot.png

This file is used by REANA to instantiate and run the analysis on the cloud.

3.7. Run your analysis on REANA cloud

We can now download reana-client command-line utility, configure access to the remote REANA cloud where we shall run the analysis, and launch it as follows:

$ # create new virtual environment
$ virtualenv ~/.virtualenvs/myreana
$ source ~/.virtualenvs/myreana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # create new workflow
$ reana-client create -n my-analysis
$ export REANA_WORKON=my-analysis
$ # upload input code and data to the workspace
$ reana-client upload ./code ./data
$ # start computational workflow
$ reana-client start
$ # ... should be finished in about a minute
$ reana-client status
$ # list workspace files
$ reana-client list
$ # download output results
$ reana-client download results/plot.png

We are done! Our outputs plot should be located in the results directory.

For more information on how to use reana-client, please see REANA-Client’s Getting started guide.

3.8. Examples

This section lists several REANA-compatible research data analysis examples that illustrate how to a typical research data analysis can be packaged in a REANA-compatible manner to facilitate its future reuse.

3.9. Next steps

For more information on how to use reana-client, you can explore REANA-Client documentation.