Chapter 2 Organization

This section is all about organizing files and folders, which include key points from the following resources:

  1. Cookiecutter Data Science
  2. Jenny Bryan’s slides about organization
  3. Jenny Bryan’s slides about naming things

2.1 Project structure

Before we begin data entry, it is important to have an easy-to-manage directory structure to store files at appropriate location.

2.1.1 Why does it matter to have a good project structure?

A clear self-documenting project structure helps newcomer to understand an analyses without having to read extensive documentation or all of the code to look for specific things. README’s are great, but if it can be made self-documenting, it does not need to be documented.

An example of good project structure adapted from Python’s version from Cookiecutter Data Science is shown below.

├── README.md          <- The top-level README.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── src                <- Source code for use in this project.
│   ├── data           <- Scripts to download or generate data.
│   │   └── make_dataset.R
│   │
│   ├── clean          <- Scripts to clean data.
│   │   └── clean_dataset.R
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.R
└── reports            <- Generated QC reports.
    └── figures        <- Generated graphics and figures to be used in reporting.

2.1.2 Why is this a good project structure?

This is a good project structure because:

  • Self explanatory: The file name and location are very informative about what it is, why it exists, how it relates to other files/directories.
  • It reflects the inputs, outputs and the flow of information.

2.2 Naming files

Having good file names would give you an idea about the information you need about the file. Jenny Bryan has good tips in naming files summarized below.

Good file names are:

Good names

2021-05-20_antarctic-penguins.txt
001-248231_myctophidae-gill.jpg
southern-ocean-jellyfish.docx

File names to avoid

1.txt 
thesis final_Final_FINAL.docx 
nJ7UyiE*.txt 
ça va.txt

2.2.1 Machine readable

Machine readable file names will enable you to easily search for a file or a group of files (globbing) easily using regular expression.

A regular expression is a sequence of characters that specifies a search pattern. (Wikipedia)

To be regular expression and globbing friendly, file names should:

Avoid:

  • special characters
  • spaces
  • punctuation
  • accented characters
  • case sensitivity

Use:

  • delimiters

Examples:

library(here)
list.files(path = here("examples/organization_example-file-names/"))
## [1] "2020-12-11_admiralty-bay_amphipod-trap.txt" "2020-12-11_admiralty-bay_van-veen.txt"      "2020-12-14_maxwell-bay_amphipod-trap.txt"  
## [4] "2020-12-14_maxwell-bay_ikmt.txt"            "2020-12-16_bransfield-strait_van-veen.txt"

ls is a command to list all files within the current (~/Desktop/projects/01_cruise-reports/) directory

Using globbing/regular expression to narrow file listing that which contains the word “van-veen”:

list.files(path = here("examples/organization_example-file-names/"), pattern = "van-veen")
## [1] "2020-12-11_admiralty-bay_van-veen.txt"     "2020-12-16_bransfield-strait_van-veen.txt"

Delimiting the file names also helps to delimit the units of metadata in the file names. For example, the file names above follow the pattern:

<date>_<sampling-station>_<sampling-protocol>.<file-extension>

  • _ underscore delimits units of metadata
  • - hyphen delimits words for readability
file_list <- list.files(path = here("examples/organization_example-file-names/"))
tbl <- stringr::str_split_fixed(file_list, "[_\\.]", 4)
colnames(tbl) <- c("date", "sampling-station", "sampling-protocol", "file-type")
tbl
##      date         sampling-station    sampling-protocol file-type
## [1,] "2020-12-11" "admiralty-bay"     "amphipod-trap"   "txt"    
## [2,] "2020-12-11" "admiralty-bay"     "van-veen"        "txt"    
## [3,] "2020-12-14" "maxwell-bay"       "amphipod-trap"   "txt"    
## [4,] "2020-12-14" "maxwell-bay"       "ikmt"            "txt"    
## [5,] "2020-12-16" "bransfield-strait" "van-veen"        "txt"

2.2.2 Human readable

File name that tells you about the file content saves you time. Similarly, using delimiters as mentioned above helps to make the file names more readable.

These file names contain the same information but delimited differently:

Without delimiters, the name is hard to read.

20201211admiraltyBayVanVeen.txt

Underscores _ to delimit units of metadata. That’s better!

20201211_admiraltybay_vanveen.txt

Underscores _ to delimit units of metadata and hyphen - to separate words for readability. Even better!

2020-12-11_admiralty-bay_van-veen.txt

2.2.3 Orderable

  • File names that start with numbers.
  • ISO 8601 standard for dates.
  • left pad other numbers with zero(s).

Meaningful names start with numbers allow files to be sorted chronologically. Note that date is in ISO 8601 standard format (YYYY-MM-DD).

2020-01-14_notes.txt
2020-02-21_notes.txt
2020-02-22_notes.txt
2020-03-16_notes.txt

If date is in format such as (DD-MM-YYYY), sorting the files does not provide chronological order of events.

14-01-2020_notes.txt
16-03-2020_notes.txt  # notes from March comes before February's
21-02-2020_notes.txt
22-02-2020_notes.txt

If files are not meaningful when ordered with date, they can be named with numeric characters first to be able to order them sequentially. For instance, a folder of images to be added into another document following a certain sequence.

001_myctophidae_diaphus-adenomus.jpg
002_myctophidae_diaphus-agassizii.jpg
...
010_myctophidae_diaphus-danae.jpg
011_myctophidae_diaphus-fragilis.jpg

If the file names are not left pad with zeros, the order will not be chronological as depicted in the example below.

10_myctophidae_diaphus-danae.jpg
1_myctophidae_diaphus-adenomus.jpg
11_myctophidae_diaphus-fragilis.jpg
2_myctophidae_diaphus-agassizii.jpg

2.3 Organizational tips

Here are a couple of quick tips from Jenny Bryan’s organization slides that help to keep your files and folders organised aside from the tips mentioned above:

2.3.1 A quarantine directory

If your collaborator send you data with space-containing file names, data in spreadsheet etc that do not fits your standard naming system and practice, you can place those files in a quarantine directory.

The renamed or exported plain text files can be move to your data directory. Record what you did in a README or comments in your code to remind yourself about the file’s source, if it is from the outside world in a state that is not ready for your programmatic analysis.

2.3.2 Revoke write permission to raw data files

Revoking write permission to raw data files avoid the files to be be accidentally edited by you or someone else.

2.3.3 A prose directory

Sometimes you need a folder to keep key emails, internal documentation, explanations or random documents received. Similar to the quarantine directory, the prose directory can be used to park these things without having to keep the same standard for file names and open formats.