Monday, March 8, 2021

How I Organize My Data

Astrophotography generates a lot of data -- what is one to do?  Between different cameras, telescopes, targets, months, how do you keep track?  I've only been doing this hobby for 5-1/2 years, and I already have over 12 TB of data!

Everyone develops their own organization scheme, but I have one that I think is particularly excellent that I'd like to share.  Maybe some parts of it will help you!

First and foremost -- keep a log book!


A logbook is an essential part of any scientist's toolkit.  Take Adam Savage's advice on it: 

Whether it's hand-written in a journal or notebook or typed up on a computer, it's important to keep track of some of the basics of every night you observe.  What gear you used, weather conditions, things that went well, things that broke, etc.  To make life easier for myself, I take notes in Google Keep's sticky-note-esque app (available both on smartphones and on any web browser), and then later dump all those notes into a Word document with more details.  I number every night in sequential order, and I've been keeping notes since my very first night of observing!  

Example note in Google Keep

I have semi-permanent setups in my backyard, so my equipment configurations don't change that often, but if yours do, then make sure to write down the gear you used.

Image Organization: In the Morning


Every morning, I bring my data acquisition laptops (DAQs) inside (I don't have a permanent computer housing built yet) and pull all of the images off of them onto a flash drive, and then over to my desktop computer.  I highly recommend copying the data off your laptop rather than cutting it; leave it on your laptop until you are sure it is safely transferred to your image processing computer.  Sometimes storage drives can have weird faults that wipe all your data, or data can be corrupted.  I check all my image files before deleting them from my DAQs.

On my "Stacks" hard drive (an 8 TB drive dedicated to deep sky images), I have a folder called "Backyard - To Process."  Within that folder are sub-folders for all of the targets on which I am currently taking data or haven't yet attempted to process.  One folder at the top is called "_to sort" (the underscore keeps it at the very top of the list).  When I copy the images off my DAQs, they go into a folder of the night's date.

The older folders have planetary & lunar data I haven't had time to deal with!

After scanning through all of last night's images using the Blink tool in PixInsight (or if you have DSLR images, you can just open them using Windows Photo Viewer or whatever other image viewer), I shuffle them out to their target folders in the "Backyard - To Process" directory.


The green tick marks are made by right-clicking the folder, clicking Properties, going to the Customize tab, and selecting "Change Icon."  It's an easy way to spot which datasets I have deemed ready to process.

Inside each of those target folders is another set of folders: lights, cal, finals, and PixInsight.  The light frames go into the "lights" folder (separated further by filter, if needed); corresponding master darks and flats go into the "cal" folder (copied over from my dark and flat libraries -- more on that in a minute); "PixInsight" is the folder in which I do my processing; and "finals" is where I keep final copies of the images.


Since I use this template for every dataset, I finally wrote myself a simple batch script to generate these folders and a copy of my metadata text file template (more on that in a bit).  They're very simple to make: create a new text file (right-click an empty place in the folder window, New->Text Document, and name it "something.bat" (no quotes).  Open it with your preferred text editor (right-click, Open With-> choose text editing program).  Mine looks like this:
mkdir lights
mkdir cal
mkdir finals
mkdir PixInsight
mkdir PixInsight\processes

copy "Q:\_Stacks\stats format.txt" finals
ren "finals\stats format.txt" stats.txt

"mkdir" means "create directory;" "copy" means, well, copy (first argument is "copy from" location, second argument is "copy to" location); and "ren" means "rename" (first argument is the file location and name that you want to rename, the second is what you want to rename it to).  

To execute the batch file, copy it into the folder you want to make the folders in and double-click it.  It will run quickly, and then you can delete the copy of the batch file.  If you want to get even fancier and move all existing images into the "lights" folder, you can add:

move *.fit lights

where the * means "all files with" and the .fit is the image extension my images files are saved as.

Don't forget, if you have a directory or filename that has spaces in its name, you need to put the whole filepath in quotes (like I did in the "copy" line above). 

Linux and Mac have different commands, but a similar idea.  (If you use Linux, I hope you already know how to do this!)

Image Organization: Each Dataset


First, I have a different hard drive for each type of data: deep sky, planetary, timelapse, and miscellaneous (this has nightscapes, images collected for competitions, solar/lunar eclipses,  other people's data that I've helped them process, pictures of my telescope setups, and whatever else doesn't have a home).  Having different hard drives is just a result of having too much data to fit on a single drive, so I broke it up my logical categories.

In general, I organize my data in this hierarchy: target, attempt.  Inside each attempt is the same setup as in "Backyard - To Process," with the cal, lights, finals, and PixInsight folders.


An "attempt" on a target can be one night, or many nights, but it's all the data I am going to combine into a single, final image.  Occasionally, I go back and combine multiple datasets; those combinations would go into the most recent attempt folder that is included in that combination.  For example, if I combine data from Lagoon #4 and Lagoon #5, the processing steps and final images would go into the Lagoon #5 folder.

Metadata File


Even if you are young with a more keen mind, once you get enough datasets rolling, it becomes easy to forget which gear you used, where you took the images, etc.  The best way to combat that is to write it down and keep it with that dataset.  In the "finals" folder, I make a simple text file called "stats.txt" that holds all that info in a standardized template I developed.  Text files are nice because they are readable on every platform, for free, and will be for a very long time.  My preferred app is Notepad++, but you can even just use the simple Notepad app that's built into Windows, or vim on Linux if you really hate yourself, or whatever text editor you prefer.


In addition to having a text file with each dataset, I also have a summary of all of these text files for easy searching in an Excel spreadsheet.  It's sortable and filterable, so I can quickly do things like find which target attempt uses compatible gear to combine datasets; find example images for creating comparisons between telescopes, cameras, techniques, etc; see when the last time I imaged a target was; see if I need to re-do a target now that I have better skills; all sorts of things.  It's also handy for when I'm at a star party or outreach event and someone asks, "How long was this exposure?" or "What telescope/camera did you use?" and I can quickly go look it up from my phone.

Green highlight means "re-process with more skill;" yellow highlight means "need more data."

Processing Files


Inside the "PixInsight" folder in the attempt folder, I have more folders that contain my processing steps.  I number them sequentially so that it's easier to go back and re-do steps if I don't like the result.


In addition, I keep notes in the metadata file with what processing steps I used and some details about them as needed (what type of pixel rejection I used in stacking, how many iterations of deconvolution I did, which subframe I used as the reference frame for registration, etc).  

Deleting Data


I never delete entire datasets, even if they seem like crap.  For one, they might actually be fine, but I don't have the skill to process them yet.  Or, if they truly are crap, they make useful teaching examples about how to spot clouds and bad tracing, or can even help diagnose problems with your gear, like frost spots or burrs on a gear.  (I do delete bad subframes in a single dataset, although sometimes I set them aside for further analysis or using as examples).  It's also fun to go back and see how bad my images used to be and how far I've come :)

To keep dataset size down, once I'm done processing, I delete all of the pre-processing data: calibrated, approved (from SubframeSelector), debayered, and registered subframes.  But I keep the original stacked data to start re-processing from (I don't often have to go back and re-stack data after I've given it a few attempts), and I keep the matching calibration files (master darks and flats) with the dataset so I can easily re-generate the pre-processing frames if needed later on.  This saves enormously on dataset size, especially now that I gather 20-30 hours of total exposure time per target these days.

File Naming Convention


Subframes


I use Sequence Generator Pro to do my data acquisition, and you can program the file naming convention right in the sequencer.  They've even got a little button with what all of the reference key symbols mean, and there are a ton of bits of information you can include in the filename.  My personal preference is a filename like "ngc-7662_30s_-20C_CLS_f202.fit," which has the important pieces of information that change from image to image for my setup: target name, exposure time, camera temperature, filter name, and then the frame number.  (I always use the same gain, offset, and binning, and I don't yet have a rotator to need the angle).  I also like to have the images for a given target be stored in a folder of that target name.  So my filename convention in SGP is this: "%tn\%tn_%el_%ct_%fe_f%04."

Other metadata, such as RA/dec, gain value, and any other SGP knows because I've programmed it into the Equipment Profile (such as pixel scale, focal length, and more) are saved in the FITS header (which can be accessed in programs like PixInsight, FitsLiberator, and more).  

Final Images

After I'm all done processing an image, it's time to save it out, in a couple of formats: XISF (PixInsight's preferred format), TIFF (for high-quality digital copies and for printing), and JPG (for posting on social media and keeping a copy of on my phone).  

The filename I give my finals files leads me straight back to where their original data are stored.  For example, Orion Nebula #17's final is named orion_15_1_5.  The convention goes: target-name_attempt_stack_process.  Each new attempt at imaging a target increments the attempt number.  Each different time I stack it (whether that's in different software, using different stacking settings, or mixing with other data) increments the stack number.  And each post-process (applying different techniques post-stacking) increments the process number.  So orion_15_1_5.jpg is the Orion Nebula, attempt #15, stack #1, process #5.



This way, when I have just the jpg on my phone, I can immediately know where to go looking for the image acquisition details (like exposure time, camera, telescope, location, etc) either in the metadata file or the Excel spreadsheet.  (This has saved me after AstroBin's massive data loss event -- I name my images on there with their attempt numbers, like 'M42 Orion Nebula #15,' so it was easy to figure out which file I needed to re-upload!)

Calibration Libraries


News flash: You don't have to take new darks and flats every night you image.  You can generate libraries of files that you can re-use, depending on the circumstances.

Darks


With cooled cameras, it's relatively easy to generate dark libraries, since you can set the camera temperature (to within what your camera can cool to depending on ambient temperature).  To build my dark library, I would set my camera out on the back porch, cover it a bin and blanket for greater darkness, run the power and USB cables inside, and then use SGP to run an all-night sequence of various exposure times at my selected temperature and gain.  I've even taken darks in my refrigerator when I needed a set I didn't have and it wasn't cold enough outside to match some recently-acquired data!

For darks, you only need new master darks under the following circumstances:
  • Different camera temperature
  • Different gain/offset
  • Different binning
  • Different exposure time
  • Different camera (even if it's the same model)
  • Periodically, as the electronics and sensor characteristics can change over time (my darks from three years ago no longer match darks I've taken more recently, so I'm having to re-do them, on my ZWO ASI1600MM Pro)
In my "Dark Archives" folder on my Stacks drive, I have my dark subframes and master darks organized by camera, then by temperature, then by gain, then by exposure time.  (If I binned, which I do for my CCD camera but not for my CMOS cameras, there would also be a 1x1 or 2x2 set of folders).  Inside of each bottom-level folder (exposure time) is the master dark, as well as the subframes (so I can re-stack if needed).


Thanks to all my effort upfront to built up my darks library, I haven't had to take new darks on my ZWO ASI1600MM Pro in over a year.  

Flats


Flats are a little more complicated -- at least, if you have a non-permanent setup.  Flats need to be re-taken under the following circumstances:
  • Different gain
  • Different filter
  • Different telescope, reducer, or other optic-train component (even non-optics components can change your flat -- like additional vignetting from a new filter wheel, adapter, or off-axis guider)
  • Different camera (even if it's the same model)
  • Every time you either rotate your camera or remove it from the telescope
The main things that flats address are vignetting and dust bunnies.  If you rotate your camera at all, you need a different set of flats because a) the dust bunnies will be in different places (unless they're on your camera sensor or window itself, of course) and b) the location of the vignetting may change since the camera is unlikely to be smack in the middle of your image circle, and because most sensors are rectangular.  

To deal with this, I organize my flats in the following hierarchy: first by camera, then by optics train (for example, "C8, focal reducer, Esatto focuser, ZWO EFW), then by date, then by filter.  



Unless your telescope is in a laboratory-grade clean room, then yes, you will need new flats every time you set up and tear down and for each different filter.  And to capture the dust bunnies, you'll need to be in focus -- so I always take my flats the next morning, after I've focused on the stars during the night.

Backups, Backups,  Backups!!


Image if your astrophotography data hard drive failed tomorrow.  How devastating would that be?  Years of data and many, many hours of hard work, gone.  Backing up your data is vitally important.  

Local Backup


For local backup, I have several external hard drives that I backup using some free software (FreeFileSync for me, but there are plenty out there) about once a month.  Each external hard drive goes with one of my internal hard drives.  They're also handy to bring to star parties for on-site processing when I only have my laptop.  The rest of the time, they live in a fireproof, waterproof safe to help ensure their survival in case of fire.  It's also important that they're not plugged in continuously so that they're protected from power surges and lightning strikes.  

I'm eventually going to set up a NAS, aka a set of hard drives in some external hardware that uses a raid array configuration to mirror data to other drives to keep it safe from hard drive failure.  All hard drives eventually fail, especially spinning-disk drives, which typically only have a lifespan of 3-5 years.

Online Backup


Local backup is still problematic; you can't keep a NAS in a fireproof safe, and you might forget to unplug your machine or NAS during a lightning storm (especially if they're frequent where you live!).  Online backup allows you to store a backup copy of your data offsite somewhere, usually distributed across many servers around the country or world.  They have better data reliability than managing a NAS or external hard drives yourself, and your data is safe if your computer or house is destroyed.  

These services come at a cost, depending on which service you go with and what type of service.  Some services allow free backup, but it costs to download your data; this is known as catastrophic backup.  Some services have different pricing tiers for the amount of data you can store and the maximum file size.  

I personally use Backblaze.  It's $6/month for unlimited file storage, and it will backup continuously if you set it to.  If you lose your data, you can either re-download it (which will take a long time if you have a lot of it, like I do) or pay to have them ship you a drive -- currently $189 for up to 8 TB (but they'll refund you if you ship it back within 30 days). 

The only issue with online backup I've run into so far is butting up against data caps.  Comcast has a data cap of 1.2 TB in most states right now (up from the usual 1 TB due to the pandemic...woo-hoo), which means I can only upload about 800 GB per month because I use about 400 GB in my daily life (video calls, Netflix, YouTube, etc).  You do get one courtesy month where they won't charge you extra ($10/50GB I think), and in that month I did manage to get about 4 TB uploaded, but with 12 TB of astro data + 1-2 TB or something of personal data and images, it's taking a long time to get backed up. I've been working on it since October (it's now March), and I have to keep an eye on my data usage and stop the backup once I get over 800 GB.  It's a pain, but once the initial upload is complete, I should be easily able to stay under the data cap -- I probably only generate about 100 GB a month or so, depending on how much processing I do, whether I do timelapses, or whether I've been to a star party.  :)

Back it up!


It's worth your while to pursue both options.  Imagine your life without your hard-won astro data!

Conclusion


Dealing with so much data can be a challenge, but I totally love the system I've developed and it works great for me.  I know pretty much exactly where to find everything, in short order, and I know everything about how I created that final image.  Put some thought into how you want to organize your data and try it out.  A little planning ahead can go a long way.  I'm eventually going to write more scripts to do more automation for me (like grabbing the matching master flats and darks for my datasets!).  

Best of luck!