Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Making Analyses Repeatable by Default | Reproducibility as a Workflow Principle
Productivity Tools for Data Scientists

bookMaking Analyses Repeatable by Default

To make your analyses reliably repeatable, you need to develop habits that prevent the common pitfalls of ad-hoc, one-off work. The first step is to use versioned data whenever possible. This means storing copies of input datasets with clear version numbers or timestamps, or using external sources that provide versioned releases. By tying your analysis to a specific data version, you ensure that rerunning your code in the future will yield the same results, even if the original data changes.

Next, always document your environment requirements. This includes the exact versions of Python and any libraries your code depends on. A simple way to do this is by maintaining a requirements.txt file or noting the output of pip freeze in your project documentation. This practice helps others (and your future self) set up an identical environment, minimizing surprises from library updates or incompatibilities.

Finally, avoid hidden state in notebooks. Hidden state occurs when the output of a cell depends on code or variables defined in previous, possibly unexecuted, cells. This can lead to confusing errors or inconsistent results. To prevent this, structure your notebook so that it can be run from top to bottom without manual intervention, and make sure all necessary variables and imports are defined in the correct order. Restarting the kernel and running all cells is a good way to check for hidden dependencies.

Notebook can be rerun from top to bottom
expand arrow
  • All data loading steps reference versioned datasets;
  • Environment requirements are documented and reproducible;
  • No variables depend on hidden, previously executed cells;
  • The notebook produces the same results every time it is run from a fresh start.
Notebook requires manual intervention
expand arrow
  • Data is loaded from sources that may change or be unavailable;
  • Environment requirements are missing or outdated;
  • Some variables or imports are only defined in earlier cells that aren't obviously required;
  • Rerunning the notebook from top to bottom fails or produces different results unless you execute cells in a specific order or manually adjust state.
question mark

What obstacles have you faced when trying to rerun your own or others’ analyses?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 2

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

bookMaking Analyses Repeatable by Default

Desliza para mostrar el menú

To make your analyses reliably repeatable, you need to develop habits that prevent the common pitfalls of ad-hoc, one-off work. The first step is to use versioned data whenever possible. This means storing copies of input datasets with clear version numbers or timestamps, or using external sources that provide versioned releases. By tying your analysis to a specific data version, you ensure that rerunning your code in the future will yield the same results, even if the original data changes.

Next, always document your environment requirements. This includes the exact versions of Python and any libraries your code depends on. A simple way to do this is by maintaining a requirements.txt file or noting the output of pip freeze in your project documentation. This practice helps others (and your future self) set up an identical environment, minimizing surprises from library updates or incompatibilities.

Finally, avoid hidden state in notebooks. Hidden state occurs when the output of a cell depends on code or variables defined in previous, possibly unexecuted, cells. This can lead to confusing errors or inconsistent results. To prevent this, structure your notebook so that it can be run from top to bottom without manual intervention, and make sure all necessary variables and imports are defined in the correct order. Restarting the kernel and running all cells is a good way to check for hidden dependencies.

Notebook can be rerun from top to bottom
expand arrow
  • All data loading steps reference versioned datasets;
  • Environment requirements are documented and reproducible;
  • No variables depend on hidden, previously executed cells;
  • The notebook produces the same results every time it is run from a fresh start.
Notebook requires manual intervention
expand arrow
  • Data is loaded from sources that may change or be unavailable;
  • Environment requirements are missing or outdated;
  • Some variables or imports are only defined in earlier cells that aren't obviously required;
  • Rerunning the notebook from top to bottom fails or produces different results unless you execute cells in a specific order or manually adjust state.
question mark

What obstacles have you faced when trying to rerun your own or others’ analyses?

Select the correct answer

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 4. Capítulo 2
some-alt