Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Making Analyses Repeatable by Default | Reproducibility as a Workflow Principle
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Productivity Tools for Data Scientists

bookMaking Analyses Repeatable by Default

To make your analyses reliably repeatable, you need to develop habits that prevent the common pitfalls of ad-hoc, one-off work. The first step is to use versioned data whenever possible. This means storing copies of input datasets with clear version numbers or timestamps, or using external sources that provide versioned releases. By tying your analysis to a specific data version, you ensure that rerunning your code in the future will yield the same results, even if the original data changes.

Next, always document your environment requirements. This includes the exact versions of Python and any libraries your code depends on. A simple way to do this is by maintaining a requirements.txt file or noting the output of pip freeze in your project documentation. This practice helps others (and your future self) set up an identical environment, minimizing surprises from library updates or incompatibilities.

Finally, avoid hidden state in notebooks. Hidden state occurs when the output of a cell depends on code or variables defined in previous, possibly unexecuted, cells. This can lead to confusing errors or inconsistent results. To prevent this, structure your notebook so that it can be run from top to bottom without manual intervention, and make sure all necessary variables and imports are defined in the correct order. Restarting the kernel and running all cells is a good way to check for hidden dependencies.

Notebook can be rerun from top to bottom
expand arrow
  • All data loading steps reference versioned datasets;
  • Environment requirements are documented and reproducible;
  • No variables depend on hidden, previously executed cells;
  • The notebook produces the same results every time it is run from a fresh start.
Notebook requires manual intervention
expand arrow
  • Data is loaded from sources that may change or be unavailable;
  • Environment requirements are missing or outdated;
  • Some variables or imports are only defined in earlier cells that aren't obviously required;
  • Rerunning the notebook from top to bottom fails or produces different results unless you execute cells in a specific order or manually adjust state.
question mark

What obstacles have you faced when trying to rerun your own or others’ analyses?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 4. Kapitel 2

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

bookMaking Analyses Repeatable by Default

Stryg for at vise menuen

To make your analyses reliably repeatable, you need to develop habits that prevent the common pitfalls of ad-hoc, one-off work. The first step is to use versioned data whenever possible. This means storing copies of input datasets with clear version numbers or timestamps, or using external sources that provide versioned releases. By tying your analysis to a specific data version, you ensure that rerunning your code in the future will yield the same results, even if the original data changes.

Next, always document your environment requirements. This includes the exact versions of Python and any libraries your code depends on. A simple way to do this is by maintaining a requirements.txt file or noting the output of pip freeze in your project documentation. This practice helps others (and your future self) set up an identical environment, minimizing surprises from library updates or incompatibilities.

Finally, avoid hidden state in notebooks. Hidden state occurs when the output of a cell depends on code or variables defined in previous, possibly unexecuted, cells. This can lead to confusing errors or inconsistent results. To prevent this, structure your notebook so that it can be run from top to bottom without manual intervention, and make sure all necessary variables and imports are defined in the correct order. Restarting the kernel and running all cells is a good way to check for hidden dependencies.

Notebook can be rerun from top to bottom
expand arrow
  • All data loading steps reference versioned datasets;
  • Environment requirements are documented and reproducible;
  • No variables depend on hidden, previously executed cells;
  • The notebook produces the same results every time it is run from a fresh start.
Notebook requires manual intervention
expand arrow
  • Data is loaded from sources that may change or be unavailable;
  • Environment requirements are missing or outdated;
  • Some variables or imports are only defined in earlier cells that aren't obviously required;
  • Rerunning the notebook from top to bottom fails or produces different results unless you execute cells in a specific order or manually adjust state.
question mark

What obstacles have you faced when trying to rerun your own or others’ analyses?

Select the correct answer

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 4. Kapitel 2
some-alt