Skip to content

Jupyter Best Practices

Jupyter Best Practices

Notebook Organization

Structure notebooks like a document — top to bottom, readable without running:

1. Title & Purpose (Markdown)
2. Imports & Configuration
3. Data Loading
4. Exploratory Analysis
5. Feature Engineering / Processing
6. Modeling / Analysis
7. Results & Conclusions

Cell Discipline

# Good — one logical unit per cell
df = pd.read_csv('data.csv')
df.head()

# Bad — too much in one cell
df = pd.read_csv('data.csv')
df = df.dropna()
df['new_col'] = df['a'] * df['b']
model = train_model(df)
results = evaluate(model)
plot_results(results)

Reproducibility

Pin Dependencies

# At the top of the notebook
import sys
print(f"Python: {sys.version}")

import pandas as pd
import numpy as np
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")

# Save exact environment
pip freeze > requirements.txt
conda env export > environment.yml

Set Random Seeds

import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# For PyTorch
import torch
torch.manual_seed(SEED)

# For TensorFlow
import tensorflow as tf
tf.random.set_seed(SEED)

Hidden state is the #1 source of notebook bugs. Before sharing:

Kernel → Restart & Clear Output
Kernel → Restart & Run All
Verify all outputs are correct

Version Control

.ipynb files are JSON — diffs are noisy. Solutions:

# nbstripout — strip outputs before committing
pip install nbstripout
nbstripout --install  # installs git filter

# nbdime — diff and merge notebooks
pip install nbdime
nbdime config-git --enable --global
nbdiff notebook_v1.ipynb notebook_v2.ipynb

`.gitignore` for Notebooks

# Strip outputs (use nbstripout instead)
# Or ignore checkpoints
.ipynb_checkpoints/

Parameterization with Papermill

Run notebooks programmatically with different parameters:

pip install papermill

# In notebook — tag a cell as 'parameters'
# (Add tag via: View → Cell Toolbar → Tags)
alpha = 0.01
n_estimators = 100
data_path = "data/train.csv"

# Execute from Python or CLI
import papermill as pm

pm.execute_notebook(
    'train_model.ipynb',
    'output/train_model_run1.ipynb',
    parameters={
        'alpha': 0.05,
        'n_estimators': 200,
        'data_path': 'data/train_v2.csv'
    }
)

# CLI
papermill train_model.ipynb output.ipynb -p alpha 0.05 -p n_estimators 200

Converting Notebooks

# To HTML
jupyter nbconvert --to html notebook.ipynb

# To PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb

# To Python script
jupyter nbconvert --to script notebook.ipynb

# To slides (reveal.js)
jupyter nbconvert --to slides notebook.ipynb --post serve

# Execute and convert
jupyter nbconvert --to html --execute notebook.ipynb

Testing Notebooks

# nbval — pytest plugin for notebooks
pip install nbval
pytest --nbval notebook.ipynb

# testbook — unit test notebook functions
pip install testbook

# testbook example
from testbook import testbook

@testbook('my_notebook.ipynb', execute=True)
def test_add(tb):
    add = tb.ref('add')
    assert add(2, 3) == 5

Performance Tips

# Use vectorized operations, not loops
# Bad
result = []
for val in df['price']:
    result.append(val * 1.1)

# Good
result = df['price'] * 1.1

# Profile before optimizing
%prun expensive_function()

# Use chunking for large files
for chunk in pd.read_csv('large.csv', chunksize=10_000):
    process(chunk)

# Dask for out-of-memory data
import dask.dataframe as dd
df = dd.read_csv('huge_file.csv')
result = df.groupby('category').amount.sum().compute()

Security

Never commit notebooks with secrets (API keys, passwords)
Use environment variables or .env files

import os
from dotenv import load_dotenv

load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')

pip install python-dotenv