3. User guide

This user guide is meant for researchers who would like to structure their data analysis and run them on REANA cloud.

3.1. Reusable analyses

Making a research data analysis reproducible basically means to provide structured “runnable recipes” addressing (1) where is the input data, (2) what software was used to analyse the data, (3) which computing environments were used to run the software and (4) which computational steps were taken to run the analysis. This will permit to instantiate the analysis on the computational cloud and run the analysis to obtain its (5) output results.

3.2. Four questions

REANA helps to make the research analysis reproducible by providing a structure helping to answer the “Four Questions”:

  1. What is your input data?
    • input data files
    • input parameters
    • live database calls
  2. Which code analyses it?
    • custom analysis code
    • analysis frameworks
    • Jupyter notebooks
  3. What is your environment?
    • operating system
    • software packages and libraries
    • CPU and memory resources
  4. Which steps did you take?
    • simple shell commands
    • complex computational workflows
    • local and/or remote task execution

Let us see step by step on how we could go about making an analysis reproducible and run it on the REANA platform.

3.3. Structure your analysis

It is advised to structure your research data analysis sources to clearly declare and separate your analysis inputs, code, and outputs. A simple hypothetical example:

$ find .
data/mydata.csv
code/mycode.py
docs/mynotes.txt
results/myplot.png

Note how we put the input data file in the data directory, the runtime code that analyses it in the code directory, the documentation in the docs directory, and the produced output plots in the results directory.

Note that this structure is fully optional and you can use any you prefer, or simply store everything in the same working directory. You can also take inspiration by looking at several real-life examples in the Examples section of the documentation.

3.4. Capture your workflows

Now that we have structured our analysis data and code, we have to provide recipe how to produce final plots.

Simple analyses

Let us assume that our analysis is run in two stages, firstly a data filtering stage and secondly a data plotting stage. A hypothetical example:

$ python ./code/mycode.py \
    < ./data/mydata.csv > ./workspace/mydata.tmp
$ python ./code/mycode.py --plot myparameter=myvalue \
    < ./workspace/mydata.tmp > ./results/myplot.png

Note how we call a given sequence of commands to produce our desired output plots. In order to capture this sequence of commands in a “runnable” or “actionable” manner, we can write a short shell script run.sh and make it parametrisable:

$ ./run.sh --myparameter myvalue

In this case you will want to use the Serial workflow engine of REANA. The engine permits to express the workflow as a sequence of commands:

    START
     |
     |
     V
+--------+
| filter |  <-- mydata.csv
+--------+
     |
     | mydata.tmp
     |
     V
+--------+
|  plot  |  <-- myparameter=myvalue
+--------+
     |
     | plot.png
     V
    STOP

Note that you can run different commands in different computing environments, but they must be run in a linear sequential manner.

The sequential workflow pattern will usually cover only simple computational workflow needs.

Complex analyses

For advanced workflow needs we may want to run certain commands in parallel in a sort of map-reduce fashion. There are many workflow systems that are dedicated to expressing complex computational schemata in a structured manner. REANA supports several, such as CWL and Yadage.

The workflow systems enable to express the computational steps in the form of Directed Acyclic Graph (DAG) permitting advanced computational scenarios.

              START
               |
               |
        +------+----------+
       /       |           \
      /        V            \
+--------+  +--------+  +--------+
| filter |  | filter |  | filter |   <-- mydata
+--------+  +--------+  +--------+
        \       |       /
         \      |      /
          \     |     /
           \    |    /
            \   |   /
             \  |  /
              \ | /
            +-------+
            | merge |
            +-------+
                |
                | mydata.tmp
                |
                V
            +--------+
            |  plot  |  <-- myparameter=myvalue
            +--------+
                |
                | plot.png
                V
               STOP

We pick for example the CWL standard to express our computational steps. We store the workflow specification in the workflow directory:

$ find workflow
workflow/myinput.yaml
workflow/myworkflow.cwl
workflow/step-filter.cwl
workflow/step-plot.cwl

You will again be able to take inspiration from some real-life examples later in the Examples section of the documentation.

To pick a workflow engine

For simple needs, the Serial workflow engine is the quickest to start with. For regular needs, CWL or Yadage would be more appropriate.

Note that the level of REANA platform support for a particular workflow engine can differ:

Engine Parametrised? Parallel execution? Caching?
CWL yes yes no(1)
Serial yes no yes
Yadage yes yes no(1)
  1. The vanilla workflow system may support the feature, but not when run via REANA environment.

Develop workflow locally

Now that we have declared our analysis input data and code, as well as captured the computational steps in a structured manner, we can see whether our analysis runs in the original computing environment. We can use the helper wrapper scripts:

$ run.sh

or use workflow-specific commands, such as cwltool in case of CWL workflows:

$ cwltool --quiet --outdir="./results" \
     ./workflow/myworkflow.cwl ./workflow/myinput.yaml

This completes the first step in the parametrisation of our analysis in a reproducible manner.

3.5. Containerise your environment

Now that we have fully described our inputs and code and the steps to run the analysis and produce our results, we need to make sure we shall be running the commands in the same environment. Capturing the environment specifics is essential to ensure reproducibility, for example the same version of Python we are using and the same set of pre-installed libraries that are needed for our analysis.

The environment is encapsulated by means of “containers” such as Docker or Singularity.

Using an existing environment

Sometimes you can use an already-existing container environment prepared by others. For example python:2.7 for Python programs or clelange/cmssw:5_3_32 for CMS Offline Software framework. In this case you simply specify the container name and the version number in your workflow specification and you are good to go. This is usually the case when your code does not have to be compiled, for example Python scripts or ROOT macros.

Note also REANA offers a set of containers that can server as examples about how to containerise popular analysis environments such as ROOT (see reana-env-root6), Jupyter (see reana-env-jupyter) or an analysis framework such as AliPhysics (see reana-env-aliphysics).

Building your own environment

Other times you may need to build your own container, for example to add a certain library on top of Python 2.7. This is the most typical use case that we’ll address below.

This is usually the case when your code needs to be compiled, for example C++ analysis.

If you need to create your own environment, this can be achieved by means of providing a particular Dockerfile:

$ find environment
environment/myenv/Dockerfile

$ less environment/Dockerfile
# Start from the Python 2.7 base image:
FROM python:2.7

# Install HFtools:
RUN apt-get -y update && \
    apt-get -y install \
       python-pip \
       zip && \
    apt-get autoremove -y && \
    apt-get clean -y
RUN pip install hftools

# Mount our code:
ADD code /code
WORKDIR /code

You can build this customised analysis environment image and give it some name, for example johndoe/myenv:

$ docker build -f environment/myenv/Dockerfile -t johndoe/myenv .

and push the created image to the DockerHub image registry:

$ docker push johndoe/myenv

Supporting arbitrary user IDs

In the Docker container ecosystem, the processes run in the containers by default use the root user identity. However, this may not be secure. If you want to improve the security in your environment you can set up your own user under which identity the processes will run.

In order for processes to run under any user identity and still be able to write to shared workspaces, we use a GID=0 technique as used by OpenShift:

  • UID: you can use any user ID you want;
  • GID: your should add your user to group with GID=0 (the root group)

This will ensure the writable access to workspace directories managed by the REANA platform.

For example, you can create the user johndoe with UID=501 and add the user to GID=0 by adding the following commands at the end of the previous Dockerfile:

# Setup user and permissions
RUN adduser johndoe -u 501 --disabled-password --gecos ""
RUN usermod -a -G 0 johndoe
USER johndoe

Testing the environment

We now have a containerised image representing our computational environment that we can use to run our analysis in another replicated environment.

We should test the containerised environment to ensure it works properly, for example whether all the necessary libraries are present:

$ docker run -i -t --rm johndoe/myenv /bin/bash
container> python -V
Python 2.7.15
container> python mycode.py < mydata.csv > /tmp/mydata.tmp

Multiple environments

Note that various steps of the analysis can run in various environments; the data filtering step on a big cloud having data selection libraries installed, the data plotting step in a local environment containing only the preferred graphing system of choice. You can prepare several different environments for your analysis if needed.

3.6. Write your reana.yaml

We are now ready to tie all the above reproducible elements together. Our analysis example becomes:

$ find .
code/mycode.py
data/mydata.csv
docs/mynotes.txt
environment/myenv/Dockerfile
workflow/myinput.yaml
workflow/myworkflow.cwl
workflow/step-filtering.cwl
workflow/step-plotting.cwl
results/myplot.png

There is only thing that remains in order to make it runnable on the REANA cloud; we need to capture the above structure by means of a reana.yaml file:

version: 0.4.0
inputs:
  files:
    - code/mycode.py
    - data/mydata.csv
  parameters:
    myparameter: myvalue
workflow:
  type: cwl
  file: workflow/myworkflow.cwl
outputs:
  files:
    - results/myplot.png

This file is used by REANA to instantiate and run the analysis on the cloud.

3.7. Declare necessary resources

You can declare other additional runtime dependencies that your workflow needs for successful operation, such as access to CVMFS. This is achieved by means of providing a resources clause in reana.yaml. For example:

workflow:
  type: serial
  resources:
    - cvmfs:
      - fcc.cern.ch
  specification:
    steps:
      - environment: 'cern/slc6-base'
        commands:
        - ls -l /cvmfs/fcc.cern.ch/sw/views/releases/

3.8. Run your analysis on REANA cloud

We can now download reana-client command-line utility, configure access to the remote REANA cloud where we shall run the analysis, and launch it as follows:

$ # create new virtual environment
$ virtualenv ~/.virtualenvs/myreana
$ source ~/.virtualenvs/myreana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # create new workflow
$ reana-client create -n my-analysis
$ export REANA_WORKON=my-analysis
$ # upload input code and data to the workspace
$ reana-client upload ./code ./data
$ # start computational workflow
$ reana-client start
$ # ... should be finished in about a minute
$ reana-client status
$ # list workspace files
$ reana-client ls
$ # download output results
$ reana-client download results/plot.png

We are done! Our outputs plot should be located in the results directory.

Note that you can inspect your analysis workspace by opening Jupyter notebook interactive sessions.

For more information on how to use reana-client, please see REANA-Client’s Getting started guide.

3.9. Examples

This section lists several REANA-compatible research data analysis examples that illustrate how to a typical research data analysis can be packaged in a REANA-compatible manner to facilitate its future reuse.

3.10. Next steps

For more information on how to use reana-client, you can explore REANA-Client documentation.