Jupyter Best Practices
Jupyter Best Practices
Section titled “Jupyter Best Practices”Notebook Organization
Section titled “Notebook Organization”Structure notebooks like a document — top to bottom, readable without running:
1. Title & Purpose (Markdown)2. Imports & Configuration3. Data Loading4. Exploratory Analysis5. Feature Engineering / Processing6. Modeling / Analysis7. Results & ConclusionsCell Discipline
Section titled “Cell Discipline”# Good — one logical unit per celldf = pd.read_csv('data.csv')df.head()
# Bad — too much in one celldf = pd.read_csv('data.csv')df = df.dropna()df['new_col'] = df['a'] * df['b']model = train_model(df)results = evaluate(model)plot_results(results)Reproducibility
Section titled “Reproducibility”Pin Dependencies
Section titled “Pin Dependencies”# At the top of the notebookimport sysprint(f"Python: {sys.version}")
import pandas as pdimport numpy as npprint(f"pandas: {pd.__version__}")print(f"numpy: {np.__version__}")# Save exact environmentpip freeze > requirements.txtconda env export > environment.ymlSet Random Seeds
Section titled “Set Random Seeds”import randomimport numpy as np
SEED = 42random.seed(SEED)np.random.seed(SEED)
# For PyTorchimport torchtorch.manual_seed(SEED)
# For TensorFlowimport tensorflow as tftf.random.set_seed(SEED)Always Restart & Run All Before Sharing
Section titled “Always Restart & Run All Before Sharing”Hidden state is the #1 source of notebook bugs. Before sharing:
Kernel → Restart & Clear OutputKernel → Restart & Run All- Verify all outputs are correct
Version Control
Section titled “Version Control”.ipynb files are JSON — diffs are noisy. Solutions:
# nbstripout — strip outputs before committingpip install nbstripoutnbstripout --install # installs git filter
# nbdime — diff and merge notebookspip install nbdimenbdime config-git --enable --globalnbdiff notebook_v1.ipynb notebook_v2.ipynb.gitignore for Notebooks
Section titled “.gitignore for Notebooks”# Strip outputs (use nbstripout instead)# Or ignore checkpoints.ipynb_checkpoints/Parameterization with Papermill
Section titled “Parameterization with Papermill”Run notebooks programmatically with different parameters:
pip install papermill# In notebook — tag a cell as 'parameters'# (Add tag via: View → Cell Toolbar → Tags)alpha = 0.01n_estimators = 100data_path = "data/train.csv"# Execute from Python or CLIimport papermill as pm
pm.execute_notebook( 'train_model.ipynb', 'output/train_model_run1.ipynb', parameters={ 'alpha': 0.05, 'n_estimators': 200, 'data_path': 'data/train_v2.csv' })# CLIpapermill train_model.ipynb output.ipynb -p alpha 0.05 -p n_estimators 200Converting Notebooks
Section titled “Converting Notebooks”# To HTMLjupyter nbconvert --to html notebook.ipynb
# To PDF (requires LaTeX)jupyter nbconvert --to pdf notebook.ipynb
# To Python scriptjupyter nbconvert --to script notebook.ipynb
# To slides (reveal.js)jupyter nbconvert --to slides notebook.ipynb --post serve
# Execute and convertjupyter nbconvert --to html --execute notebook.ipynbTesting Notebooks
Section titled “Testing Notebooks”# nbval — pytest plugin for notebookspip install nbvalpytest --nbval notebook.ipynb
# testbook — unit test notebook functionspip install testbook# testbook examplefrom testbook import testbook
@testbook('my_notebook.ipynb', execute=True)def test_add(tb): add = tb.ref('add') assert add(2, 3) == 5Performance Tips
Section titled “Performance Tips”# Use vectorized operations, not loops# Badresult = []for val in df['price']: result.append(val * 1.1)
# Goodresult = df['price'] * 1.1
# Profile before optimizing%prun expensive_function()
# Use chunking for large filesfor chunk in pd.read_csv('large.csv', chunksize=10_000): process(chunk)
# Dask for out-of-memory dataimport dask.dataframe as dddf = dd.read_csv('huge_file.csv')result = df.groupby('category').amount.sum().compute()Security
Section titled “Security”- Never commit notebooks with secrets (API keys, passwords)
- Use environment variables or
.envfiles
import osfrom dotenv import load_dotenv
load_dotenv()api_key = os.getenv('OPENAI_API_KEY')pip install python-dotenv