Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Making Analyses Repeatable by Default | Reproducibility as a Workflow Principle
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Productivity Tools for Data Scientists

bookMaking Analyses Repeatable by Default

To make your analyses reliably repeatable, you need to develop habits that prevent the common pitfalls of ad-hoc, one-off work. The first step is to use versioned data whenever possible. This means storing copies of input datasets with clear version numbers or timestamps, or using external sources that provide versioned releases. By tying your analysis to a specific data version, you ensure that rerunning your code in the future will yield the same results, even if the original data changes.

Next, always document your environment requirements. This includes the exact versions of Python and any libraries your code depends on. A simple way to do this is by maintaining a requirements.txt file or noting the output of pip freeze in your project documentation. This practice helps others (and your future self) set up an identical environment, minimizing surprises from library updates or incompatibilities.

Finally, avoid hidden state in notebooks. Hidden state occurs when the output of a cell depends on code or variables defined in previous, possibly unexecuted, cells. This can lead to confusing errors or inconsistent results. To prevent this, structure your notebook so that it can be run from top to bottom without manual intervention, and make sure all necessary variables and imports are defined in the correct order. Restarting the kernel and running all cells is a good way to check for hidden dependencies.

Notebook can be rerun from top to bottom
expand arrow
  • All data loading steps reference versioned datasets;
  • Environment requirements are documented and reproducible;
  • No variables depend on hidden, previously executed cells;
  • The notebook produces the same results every time it is run from a fresh start.
Notebook requires manual intervention
expand arrow
  • Data is loaded from sources that may change or be unavailable;
  • Environment requirements are missing or outdated;
  • Some variables or imports are only defined in earlier cells that aren't obviously required;
  • Rerunning the notebook from top to bottom fails or produces different results unless you execute cells in a specific order or manually adjust state.
question mark

What obstacles have you faced when trying to rerun your own or others’ analyses?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 4. Capitolo 2

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you give examples of how to version data in practice?

What are some tools for managing environment requirements?

How do I check for hidden state in my Jupyter notebook?

bookMaking Analyses Repeatable by Default

Scorri per mostrare il menu

To make your analyses reliably repeatable, you need to develop habits that prevent the common pitfalls of ad-hoc, one-off work. The first step is to use versioned data whenever possible. This means storing copies of input datasets with clear version numbers or timestamps, or using external sources that provide versioned releases. By tying your analysis to a specific data version, you ensure that rerunning your code in the future will yield the same results, even if the original data changes.

Next, always document your environment requirements. This includes the exact versions of Python and any libraries your code depends on. A simple way to do this is by maintaining a requirements.txt file or noting the output of pip freeze in your project documentation. This practice helps others (and your future self) set up an identical environment, minimizing surprises from library updates or incompatibilities.

Finally, avoid hidden state in notebooks. Hidden state occurs when the output of a cell depends on code or variables defined in previous, possibly unexecuted, cells. This can lead to confusing errors or inconsistent results. To prevent this, structure your notebook so that it can be run from top to bottom without manual intervention, and make sure all necessary variables and imports are defined in the correct order. Restarting the kernel and running all cells is a good way to check for hidden dependencies.

Notebook can be rerun from top to bottom
expand arrow
  • All data loading steps reference versioned datasets;
  • Environment requirements are documented and reproducible;
  • No variables depend on hidden, previously executed cells;
  • The notebook produces the same results every time it is run from a fresh start.
Notebook requires manual intervention
expand arrow
  • Data is loaded from sources that may change or be unavailable;
  • Environment requirements are missing or outdated;
  • Some variables or imports are only defined in earlier cells that aren't obviously required;
  • Rerunning the notebook from top to bottom fails or produces different results unless you execute cells in a specific order or manually adjust state.
question mark

What obstacles have you faced when trying to rerun your own or others’ analyses?

Select the correct answer

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 4. Capitolo 2
some-alt