Skip to content

Jupyter Best Practices

Structure notebooks like a document — top to bottom, readable without running:

1. Title & Purpose (Markdown)
2. Imports & Configuration
3. Data Loading
4. Exploratory Analysis
5. Feature Engineering / Processing
6. Modeling / Analysis
7. Results & Conclusions
# Good — one logical unit per cell
df = pd.read_csv('data.csv')
df.head()
# Bad — too much in one cell
df = pd.read_csv('data.csv')
df = df.dropna()
df['new_col'] = df['a'] * df['b']
model = train_model(df)
results = evaluate(model)
plot_results(results)
# At the top of the notebook
import sys
print(f"Python: {sys.version}")
import pandas as pd
import numpy as np
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
Terminal window
# Save exact environment
pip freeze > requirements.txt
conda env export > environment.yml
import random
import numpy as np
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
# For PyTorch
import torch
torch.manual_seed(SEED)
# For TensorFlow
import tensorflow as tf
tf.random.set_seed(SEED)

Hidden state is the #1 source of notebook bugs. Before sharing:

  1. Kernel → Restart & Clear Output
  2. Kernel → Restart & Run All
  3. Verify all outputs are correct

.ipynb files are JSON — diffs are noisy. Solutions:

Terminal window
# nbstripout — strip outputs before committing
pip install nbstripout
nbstripout --install # installs git filter
# nbdime — diff and merge notebooks
pip install nbdime
nbdime config-git --enable --global
nbdiff notebook_v1.ipynb notebook_v2.ipynb
Terminal window
# Strip outputs (use nbstripout instead)
# Or ignore checkpoints
.ipynb_checkpoints/

Run notebooks programmatically with different parameters:

Terminal window
pip install papermill
# In notebook — tag a cell as 'parameters'
# (Add tag via: View → Cell Toolbar → Tags)
alpha = 0.01
n_estimators = 100
data_path = "data/train.csv"
# Execute from Python or CLI
import papermill as pm
pm.execute_notebook(
'train_model.ipynb',
'output/train_model_run1.ipynb',
parameters={
'alpha': 0.05,
'n_estimators': 200,
'data_path': 'data/train_v2.csv'
}
)
Terminal window
# CLI
papermill train_model.ipynb output.ipynb -p alpha 0.05 -p n_estimators 200
Terminal window
# To HTML
jupyter nbconvert --to html notebook.ipynb
# To PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb
# To Python script
jupyter nbconvert --to script notebook.ipynb
# To slides (reveal.js)
jupyter nbconvert --to slides notebook.ipynb --post serve
# Execute and convert
jupyter nbconvert --to html --execute notebook.ipynb
Terminal window
# nbval — pytest plugin for notebooks
pip install nbval
pytest --nbval notebook.ipynb
# testbook — unit test notebook functions
pip install testbook
# testbook example
from testbook import testbook
@testbook('my_notebook.ipynb', execute=True)
def test_add(tb):
add = tb.ref('add')
assert add(2, 3) == 5
# Use vectorized operations, not loops
# Bad
result = []
for val in df['price']:
result.append(val * 1.1)
# Good
result = df['price'] * 1.1
# Profile before optimizing
%prun expensive_function()
# Use chunking for large files
for chunk in pd.read_csv('large.csv', chunksize=10_000):
process(chunk)
# Dask for out-of-memory data
import dask.dataframe as dd
df = dd.read_csv('huge_file.csv')
result = df.groupby('category').amount.sum().compute()
  • Never commit notebooks with secrets (API keys, passwords)
  • Use environment variables or .env files
import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')
Terminal window
pip install python-dotenv