Data Flow

16 Nov 2020

How to cite this blog post:

Joukhadar, Z., Spreadborough, K. (2020, November, 16) Data Flow. Kristal Spreadborough’s Blog https://kristalspreadborough.github.io/blog/2020/11/16/data-flow

This is an overview of one way to structure a projects pipeline and data folders. My colleague Zaher Joukhadar at the Melbourne Data Analytics Platform (MDAP) introduced me to this structure and described the various aspects of it to me over a series of conversations. Over the last few months, I have watched Zaher implement this and similar pipelines and data folders in MDAP collaborations. But I could not find a resource which put this process in writing. And such a resource was exactly what I needed not being a computer science native myself! Thanks to Zaher’s generosity with his time and willingness to share his experience and knowledge, I have committed the process to writing in this blog post as a resource for myself and others who might be interested. Full credit goes to Zaher - check out his website and follow him on the socials!

Table of Contents

1 Pipeline

The pipeline outlines the flow of code and data.

pipeline

1.1 Code

The code is contained within the module called ‘code’ (Note: this module can be called anything, ‘code’ is just for this example). The code is stored on the collaboration repository (repo).

‘code’ itself is a package. Each of the submodules are also packages. To get Python to treat each level module as a package, [include a init.py file within each submodule] (https://docs.python.org/3/tutorial/modules.html#packages). See below. Note: this will no longer be required if using Python 3.3 or later. However, having them won’t affect how the package runs so they may be useful to include incase the package is being run on earlier versions of Python.

All submodules containing code for the package start with the letter ‘p’. The numbers following indicate the order in which the packages should be run.

RepoPackageFolder

Note: I tend not to produce models as part of my work, but rather focus on analyses and visualizations. For this reason, I tend to replace P04_modeling with p04_analysis and/or p04_visualizations. I tend not to use p05 at all.

In addition to the code stored in these submodules, you may also want to have base and local yaml files. The local yaml file is stored only on your local machine and holds passwords, keys, tokens etc. It should not be pushed to the repo. The base yamal contains switches which allow particular scripts/submodules to be turned on and off, contains dependancies, etc. Adding these to the yaml rather than hard coding them makes updating easier.

You can also include a snakemake file which can be used for optimising performance in HPC environments: https://snakemake.readthedocs.io/en/stable/

1.2 Data Folders

All data folders start with ‘d’. The two numbers following indicate the submodule of the package to which the data folder corresponds. In this example, data folders are stored on Volume Storage and MediaFlux (these are specific storage solutions for University of Melbourne and you may have different storage solutions), and in the repository (repo).

data

In this example, data is written out after every submodule in the package ‘code’. This is not necessary, but it is a good way to be able to reverse engineer your steps if something goes wrong. It also means that you have the data ready to go if you want to pick up from a particular submodule of the package (e.g. you don’t have to the clean the data every time).

1.2.1 Volume Storage and Mediaflux

Original, derived, and analytics data folders are shown on either side of the repo. This is symbolic only, data are all stored in the same place (in this example, both Volume Storage - for working with the data - and MediaFlux - for backing up the data at every step). Never store this data on your repo. Not only are these files too large, but there could also be issues with Ethics (some data can’t be shared, but the code and outputs can). Also, there are better platforms for sharing data, such as FigShare.

Note: As mentioned above, because I tend to work with analyses and visulisations, I use the Analytics folder to store these outputs.

1.2.2 Repository

The data folder d06_reporting on the repo contains the outputs of the p06_reporting submodule described above. It is hosted on the repo since these files are not very large. Hosting on the repo also makes these outputs easily sharable with the end users since they can be accessed by a link and downloaded. In this example, there are two subfolders, but you can use whatever subfolders suit your workflow:

Note: you may prefer to keep d06_reporting in volume storage and MediaFlux - it doesn’t matter too much.

2 Data

This section is concerned with the original data:

original

This is just one way to structure your original data, you can adapt as you need. How you group and order your folders and subfolders will depend on the nature of your data and the intended analysis. In this example, original data is grouped by type: Twitter and Hansard. In addition to the actual data, the following files can also be contained with in the folder structure:

originalfilestructure

filename timestamp description file type
Name of the file or folder. Should include path to file or folder. Time or date stamp for file or n/a if not needed A short description of the file The type of file (either a description or the actual MIME type)

Note: think about these files containing metadata about the data. The metadata should be enough for someone else to be able to understand and reproduce the data.