How to Organise Your Data

How to backup and archive your files? There are almost as many answers as there are people managing their data. In this post I attempt to document the methods I use to deal with the increasingly growing heaps of bits in my life. I hope it to give you ideas how to manage yours!

General approach in short:

Dropbox 2 TB for private projects and archival
Google Drive 100 GB for work and collaborative projects
Desktop computer with 1 TB SSD that automatically mirrors all the data in Dropbox and Google Drive
Laptop computer that automatically mirrors only the most recent data in Dropbox and Google Drive
Mobile external 2 TB HDD to which I manually backup all the data from Dropbox and Google Drive once in a while
Mobile external 500 GB HDD for downloaded files and other non-essential data
Prefix archived folders by their creation date in format YYYY-MM-DD
Name folders with searchable tag-like words
Use rsync command for the manual backups

Each of the listed storage devices has its purpose. Next I will explain the motivation behind these. Then I will go into details how I organise the data so that it remains useful for years to come. Finally I discuss how I would like to improve the procedures for greater benefit.

Storage Devices

Desktop computer. When I was a child I had all my data stored on a single computer. Soon, I became aware of the risks of data loss due to a hardware failure. At first, I backed up the most important stuff, being some sketchy MS Paint drawings and Duke Nukem 3D maps, on 1.44 MB floppy disks. After my family got a CD writer drive, CDs took place. At some point, part of the burden was delegated to a couple USB sticks and an external 200 GB hard drive. Although pretty scattered, this continued to be my main data storage and backup method for years.

Dropbox. Around 2010 I came to my senses. Instead of maintaining the scattered and multiple more or less duplicated storage devices, I wanted something simple yet robust. Dropbox was the perfect match. I placed all my files on my desktop computer and they were automatically mirrored to Dropbox cloud. If my machine failed, the data was safe in the cloud. If the cloud failed, the data was still safe on my local machine. As an additional bonus, I now had easy mobile access to all my files. I was free to discard all my old CDs and HDDs.

Google Drive for work. After a brief stardom of Etherpad, Google Docs stole the show and became the de-facto standard how students collaborated at my university. After I started my own business, this way of collaboration expanded to my client projects. As most of these business documents were already in Google Drive, it was only natural to store all my business-related stuff in Google Drive instead of Dropbox. At first it felt a bit unnecessary to have two cloud storages, and it still feels so, but it provided a nice separation between the business data and my private data.

Mobile external HDD for non-essential files. There are many types of data and not all are equally unique or important. As the available storage space on Dropbox is limited and costs money, there is no reason for me to store every downloaded image and ripped music album there. Therefore I have set up a habit that my Dropbox will contain almost exclusively the files that I have personally created or to which I or my friends have copyrights. Everything else I keep on an external hard drive. If this hard drive fails, I will lose an easy access to some movies and music albums but that is not the end of the world as I can download them again at any time, at least hopefully so.

Another mobile external HDD for manual snapshots. As long as I have used Dropbox, I have carried a fear that one day Dropbox will mess up and my desktop suddenly syncs to an empty directory, discarding all the mirrored files. I trust Dropbox to know what they are doing but still, Murphy’s law is always there. I have been in IT business long enough to know that sooner or later things hit the fan. Therefore to add the extra layer of security, I began taking occasional snapshots of my Dropbox and Google Drive contents around 2020. I store the snapshots to a mobile external HDD. The mobility also brings convenience to situations when I need to work with some old files on my partially-synced laptop. Instead of slowly downloading several gigabytes of old photos and videos I can just copy them from the HDD.

These were my motivations behind having all these storage devices. How to organise the files in them is a quite different beast. Let us tame it next.

Organising Files and Folders

It is somewhat counterproductive to maintain various devices to keep your data safe if the data is one giant jungle from which you cannot find anything. After fighting the issue over the years, I have ended up setting a few habits regarding directory structure, file naming, and documentation. The following descriptions of these habits are not complete but might give you ideas how to arrange your own files.

Archive as archive/YYYY/YYYY-MM-DD-project-name/. When I first started to take digital photographs at the age of 15, I arranged them into folders by their subjects. Flower pics went under flowers/ and cat pics went under animals/cats/ et cetera. After a couple of years or so I ended this madness. Why? For several reasons. First, such taxonomy grows huge and is always growing. Second, even the simplest pictures could fit several of the folders. For example a picture of a bird taken in winter could fit under both birds/ and winter/. Third, the context of the image is lost. I would have a hard time reconstructing the sequence of photos that together capture what I experienced that day. As the file system allowed only single hierarchy, I needed a better method.

Therefore, I began grouping photos by their shooting date and also other files by their creation date. This was the simplest yet most consistent way to keep things in order. I chose to prefix each folder name with their date in format YYYY-MM-DD. This order ensures that alphabetical and chronological sorting orders are equal as the numbers are in the order of significancy, most significant first.

For folder name after the date prefix, I prefer a list of tag-like keywords over a grammatically correct title. For example, instead of 2020-09-05-presentation-on-our-exploration-in-norwegian-caves I would rather use 2020-09-05-norway-north-cave-mining-trip-presentation. This helps the search tools of Dropbox and filesystems to find relevant content. I also prefer to use hyphen – and underscore _ over whitespace as whitespaces are sometimes problematic in URLs (encoded as %20) and also require escaping when accessed via command line. Yet whitespaces are nicer for the eye to read so I find myself to be a bit inconsistent in this matter.

For photos and other sequential media files, I tend to keep the original filename generated by the camera (e.g. DSCF2134.JPG) and prefix and postfix it appropriately. Occasionally I need to combine photos and videos shot by multiple devices. In those cases, I prefix the original filename with date and time in format YYYY-MM-DD-HHMMSS. This allows me to reconstruct the actual sequence of events regardless how many people or devices captured the material. To enable quick visual search and also keyword search I postfix at least a few key photos in the set with keywords that describe the subject. If the photo is a very good one, I tend to prefix it also with STAR. As a consequence, some of my files end up having quite long but practical names as for example: 2020-08-12-141308-DSCF2134-burnt-forest-STAR.JPG.

Why do I not use photo organising software? There are multiple high quality photo organising software that allow addition of keywords and starring to the photo meta data. The reason why do I not use them is their poor compatibility across operating systems and devices. Not all search tools search the photo metadata but they all allow searching from the filename. Also, if I fix myself to use one single software, I will experience the vendor lock-in, meaning the difficulty to switch to other software because the original software stores the stuff differently. By using only filenames, regardless the length, there will be no vendor lock-in and the compatibility across devices is maximised. I might have a bit conservative view on this subject but this is what I have found to be the best approach for me.

Compress all third-party source code and projects with hidden files. Sometimes my projects contain third-party source code or applications. These tend to mess up the search results as they may contain hundreds or even thousands of files full with keywords. Therefore it is best to compress them if they cannot be removed completely. Also projects that depend on hidden files require archival into single compressed files. This is because, at least on macOS, copying a directory does not necessarily copy the hidden files. After compression, the package contains all the files regardless hidden or not. For my usual archival compression, I use tar with gzip:

$ tar -czvf third-party-app.tar.gz third-party-app

Document what the archived directory is about into a README file. Human memory is imperfect. What are these files. Can I remove them? Are they already archived somewhere else? Time after time I find myself opening an old project and wondering these questions. Therefore, if there is even a slight change that the project directory will be useful in future, I tend to write a short README.md or README.txt file at the project root. In the README I briefly explain the purpose of the project and its directories. Although the purpose is probably as clear as day right now, it will not be like that after a few years.

Can I modify the archived files? This is a divided issue to which I have not found a single best answer yet. It is useful to refine the archived files by appending keywords to the names, removing unnecessary files and duplicates, and making the files and directories more usable in general. On the other hand, by keeping the archives intact, you ensure files will stay where they were and you can refer to them with their paths. Additional benefit of such append-only archives is that it makes manual backups simpler, which is part of our next topic.

Take manual backups of everything. As previously discussed, due to Murphy’s law I do not fully trust all my data at the mercy of Dropbox’s sync algorithm. Therefore I take additional manual backups of my Dropbox and Google Drive contents. Taking a manual backup can be a tedious task. It takes a long time to copy all the files again and again. To backup only the changed files, I use rsync. Rsync is basically a tool that examines the contents of the source and the target directories and copies only the difference. Optionally rsync can also remove the files in the target that are missing in the source, thus mirroring the source exactly. What could go wrong?

How to detect an intentional deletion from an unintentional data loss? This is a problem that troubles me the most and I try to explain it a little further. Let us assume I run rsync to backup my files so that the target becomes an exact copy, a mirror, of the source. Let us then imagine a rare event that the Dropbox sync somehow fails and discards the year 2008 in my archives. I might not detect the missing year in many months. Unaware of the fail, I clean up some of my archived projects and delete a few video files. Then, I mirror the Dropbox contents to the external drive and let rsync to delete anything that is not anymore in my Dropbox. The problematic result is this: the discarded video files become deleted as intended. Yet the whole year 2008 is removed too. How to solve the problem?

One approach would be to keep the archives intact. Once a file is archived it is destined to stay that way forever. This has a drawback as previously implied: to follow this approach the archives cannot be refined. Filenames cannot be changed, no keywords added and no directories restructured. In other words, I cannot improve the quality of the data as soon as it enters the one-way door of the archives. If I could draw any predictions based on my current behaviour, the result of the door will be that unarchived projects pile up at the Dropbox root, waiting for the day when I have enough time to organise the project files so well that I am happy for the rest of my life. Therefore this approach does not work for me.

A strict append-only policy on the external backup drive is another possible approach. This policy lets you modify the Dropbox archives as you wish but does not allow any further modification of the files once they reach the the manual backup disk. As a consequence, if I rename an archived directory, its contents will be doubled on the backup. The backup directory structure will grow messy but ensures that the data exists. This might not be the optimal method if you have work-in-progress projects that exceed several gigs or if you need your backup disk to be as nicely organised as your Dropbox.

Rsync dry-run to the rescue! Yet another approach is to do a possibly-destructive mirror backup but only after you have ensured that all deletions are valid. You can do this by running the rsync with –dry-run flag before the actual run. This kind of gives you the best of both worlds. You can modify and refine your archives and yet your manual backups are safe from rare mishaps. The price to pay is the additional burden to inspect the dry-run output to ensure everything will go as you expected. To do this, use the following two commands:

$ rsync -avh --dry-run --delete ~/Dropbox /Volumes/BACKUP
$ rsync -avh --delete ~/Dropbox /Volumes/BACKUP

The full documentation of the flags is available at rsync man page. Briefly put -a ensures that modification times, permissions, groups and so are copied as-is. The flag -v is for verbose output and -h for human readable file sizes. The flag --delete ensures that the files that exist at the destination but not in the source will be deleted. To keep your mind at ease, the source directory always stays unaltered and the possible deletions and modifications happen only at the target.

Document your backup methods. Whatever way you decide to organise your files, once you have established a good way, write it down. Document especially your modification policies and shell commands. Although the main things are easy to remember the details will become hard to recall after time. In the worst case, if you mess up, your data is lost. A clear and brief README.txt file at the root of your backup device is well sufficient.

Be safe. Stay sharp.

Storage Devices

Organising Files and Folders

Leave a Comment Cancel reply