Chapter 2 Organization
This section is all about organizing files and folders, which include key points from the following resources:
- Cookiecutter Data Science
- Jenny Bryan’s slides about organization
- Jenny Bryan’s slides about naming things
2.1 Project structure
Before we begin data entry, it is important to have an easy-to-manage directory structure to store files at appropriate location.
2.1.1 Why does it matter to have a good project structure?
A clear self-documenting project structure helps newcomer to understand an analyses without having to read extensive documentation or all of the code to look for specific things. README’s are great, but if it can be made self-documenting, it does not need to be documented.
An example of good project structure adapted from Python’s version from Cookiecutter Data Science is shown below.
├── README.md <- The top-level README.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── src <- Source code for use in this project.
│ ├── data <- Scripts to download or generate data.
│ │ └── make_dataset.R
│ │
│ ├── clean <- Scripts to clean data.
│ │ └── clean_dataset.R
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.R
└── reports <- Generated QC reports.
└── figures <- Generated graphics and figures to be used in reporting.
2.1.2 Why is this a good project structure?
This is a good project structure because:
- Self explanatory: The file name and location are very informative about what it is, why it exists, how it relates to other files/directories.
- It reflects the inputs, outputs and the flow of information.
2.2 Naming files
Having good file names would give you an idea about the information you need about the file. Jenny Bryan has good tips in naming files summarized below.
Good file names are:
Good names
2021-05-20_antarctic-penguins.txt
001-248231_myctophidae-gill.jpg
southern-ocean-jellyfish.docx
File names to avoid
1.txt
thesis final_Final_FINAL.docx
nJ7UyiE*.txt
ça va.txt
2.2.1 Machine readable
Machine readable file names will enable you to easily search for a file or a group of files (globbing) easily using regular expression.
A regular expression is a sequence of characters that specifies a search pattern. (Wikipedia)
To be regular expression and globbing friendly, file names should:
Avoid:
- special characters
- spaces
- punctuation
- accented characters
- case sensitivity
Use:
- delimiters
Examples:
library(here)
list.files(path = here("examples/organization_example-file-names/"))
## [1] "2020-12-11_admiralty-bay_amphipod-trap.txt" "2020-12-11_admiralty-bay_van-veen.txt" "2020-12-14_maxwell-bay_amphipod-trap.txt"
## [4] "2020-12-14_maxwell-bay_ikmt.txt" "2020-12-16_bransfield-strait_van-veen.txt"
ls
is a command to list all files within the current (~/Desktop/projects/01_cruise-reports/
) directory
Using globbing/regular expression to narrow file listing that which contains the word “van-veen”:
list.files(path = here("examples/organization_example-file-names/"), pattern = "van-veen")
## [1] "2020-12-11_admiralty-bay_van-veen.txt" "2020-12-16_bransfield-strait_van-veen.txt"
Delimiting the file names also helps to delimit the units of metadata in the file names. For example, the file names above follow the pattern:
<date>_<sampling-station>_<sampling-protocol>.<file-extension>
_
underscore delimits units of metadata-
hyphen delimits words for readability
<- list.files(path = here("examples/organization_example-file-names/"))
file_list <- stringr::str_split_fixed(file_list, "[_\\.]", 4)
tbl colnames(tbl) <- c("date", "sampling-station", "sampling-protocol", "file-type")
tbl
## date sampling-station sampling-protocol file-type
## [1,] "2020-12-11" "admiralty-bay" "amphipod-trap" "txt"
## [2,] "2020-12-11" "admiralty-bay" "van-veen" "txt"
## [3,] "2020-12-14" "maxwell-bay" "amphipod-trap" "txt"
## [4,] "2020-12-14" "maxwell-bay" "ikmt" "txt"
## [5,] "2020-12-16" "bransfield-strait" "van-veen" "txt"
2.2.2 Human readable
File name that tells you about the file content saves you time. Similarly, using delimiters as mentioned above helps to make the file names more readable.
These file names contain the same information but delimited differently:
Without delimiters, the name is hard to read.
20201211admiraltyBayVanVeen.txt
Underscores _
to delimit units of metadata. That’s better!
20201211_admiraltybay_vanveen.txt
Underscores _
to delimit units of metadata and hyphen -
to separate words for readability. Even better!
2020-12-11_admiralty-bay_van-veen.txt
2.2.3 Orderable
- File names that start with numbers.
- ISO 8601 standard for dates.
- left pad other numbers with zero(s).
Meaningful names start with numbers allow files to be sorted chronologically. Note that date is in ISO 8601 standard format (YYYY-MM-DD).
2020-01-14_notes.txt
2020-02-21_notes.txt
2020-02-22_notes.txt
2020-03-16_notes.txt
If date is in format such as (DD-MM-YYYY), sorting the files does not provide chronological order of events.
14-01-2020_notes.txt
16-03-2020_notes.txt # notes from March comes before February's
21-02-2020_notes.txt
22-02-2020_notes.txt
If files are not meaningful when ordered with date, they can be named with numeric characters first to be able to order them sequentially. For instance, a folder of images to be added into another document following a certain sequence.
001_myctophidae_diaphus-adenomus.jpg
002_myctophidae_diaphus-agassizii.jpg
...
010_myctophidae_diaphus-danae.jpg
011_myctophidae_diaphus-fragilis.jpg
If the file names are not left pad with zeros, the order will not be chronological as depicted in the example below.
10_myctophidae_diaphus-danae.jpg
1_myctophidae_diaphus-adenomus.jpg
11_myctophidae_diaphus-fragilis.jpg
2_myctophidae_diaphus-agassizii.jpg
2.3 Organizational tips
Here are a couple of quick tips from Jenny Bryan’s organization slides that help to keep your files and folders organised aside from the tips mentioned above:
2.3.1 A quarantine directory
If your collaborator send you data with space-containing file names, data in spreadsheet etc that do not fits your standard naming system and practice, you can place those files in a quarantine directory.
The renamed or exported plain text files can be move to your data directory. Record what you did in a README or comments in your code to remind yourself about the file’s source, if it is from the outside world in a state that is not ready for your programmatic analysis.
2.3.2 Revoke write permission to raw data files
Revoking write permission to raw data files avoid the files to be be accidentally edited by you or someone else.
2.3.3 A prose directory
Sometimes you need a folder to keep key emails, internal documentation, explanations or random documents received. Similar to the quarantine directory, the prose directory can be used to park these things without having to keep the same standard for file names and open formats.