Organize Your Project Repo Like an Actual Human
A well-organized project repo allows visitors to see all of what they need to see and none of what they don’t.
When we tackle a new project, the temptation arises to jump headlong into a fresh Jupyter notebook with little forethought to its future readability. We code and code until we’ve worn out our initial train of thought, upon which we open a new notebook and wear out the next train of thought, and so on and so on until we’re done — but by now a deadline is looming. What follows is a mad dash through a daisy chain of .ipynb files to mark down and comment what we have, with the rather dismal hope that our readers will go easy on us.
It’s a habit worth breaking.
A project repo in an employer-facing portfolio should be more than a hodgepodge of papered-over scratchwork. So what can be done? Think in terms of three main audiences:
1. Non-technical readers with a general interest in your topic and findings.
2. Technically-minded readers with a general interest in your topic and findings.
3. Technically-minded readers with a more thorough interest in every aspect of your project.
Organize your repo in such a way that the first audience and the second audience can access the materials that will engage them without getting lost in the finer details. But also, include the finer details, because the third audience is out there. Observe:
This is just an example, but everything is here.
The README can serve as a summary and guide for nontechnical readers, as can some version of the final presentation if one was included in your project.
The main.ipynb file and the appendix folder go hand in hand, with main.ipynb serving as the primary project notebook. This notebook summarizes — summarizes — all the major code-focused phases of your project in a unified and comprehensive way. The purpose is not to combine an entire project’s worth of cleaning, modeling, and app creation into a single file, but to provide more technically-oriented readers a guided tour of the highlights. Instead of crowding your repo’s landing page with an endless sea of code files, tuck these more involved notebooks neatly into the appendix folder, and reference them abundantly in the markdown of main.ipynb.
If the code notebooks in your appendix folder are organized in a clear, intuitive way, your primary project notebook writes itself. Include a section in main.ipynb for each major phase of your project, distilling its contents for the general interest reader. A proper introduction is a must, as is a conclusion. Code cells might include demonstrations of specific functions you wrote or imported for cleaning and exploratory analysis, then a fitting and scoring of your final model(s) on their corresponding datasets.
Craft your markdown with care, giving special attention to key takeaways you’d like your reader to absorb. Provide section headings wherever necessary. (Bonus points for adding links to the appendix files when you reference them here in main.ipynb.)
Doing all this allows casual readers to get the most out of your work, while true seekers can easily access items they’d like to explore further.
Finally, the writeouts folder or something similar might be a wise addition for any project that involves long displays of information that would otherwise result in excessive scrolling for the visitor. Most readers will take you at your word about this or that datatype without a df.info() thrown in, but if you’d like to show your work, an elegant use of with open can be the perfect solution, and the writeouts folder provides a destination.
A repo such as this one will serve you well for most simple projects. Depending on your team’s needs and the scope of the project, it has some light variations. Though GitHub separates files from folders before alphabetizing items by name, if anyone on your team is using a GUI to view the main project directory, files and folders may get jumbled:
Not the end of the world, but if it’s creating confusion, an easy fix is to begin each folder name with an underscore instead:
This neatens up the display universally. Don’t use this trick on the file names, because your repo hosting site may require that your README be named in a specific way to load automatically for visitors.
For larger projects it may be unavoidable that the notebooks pile on:
There’s nothing inherently “bad” about this. One appendix folder is fine in most cases — but if you or other team members are viewing these items in a cramped visual field, multiple files with similar names can create confusion.
If this happens, it might make sense to add subdirectories to your appendix folder (or now, your appendices folder) so that everything is as self-evident as possible:
Then within each subdirectory, number and name your notebooks accordingly:
And the rest is silence.
We’re often introduced to GitHub and Jupyter notebooks when we have a million other things to learn, all with deadlines. When the ins and outs of Python and the RELU function and gamma distributions are swarming our minds, it’s easy for basic principles of communication to get swept aside as something extra. Which would be fine, except that it’s completely backwards.
Remember that a project isn’t a project until it’s for someone. By keeping your audience in mind, you’ll always have a project from the very beginning.