The Sunshine State Digital Network (SSDN) is a service hub for the Digital Public Library of America (DPLA), and has it’s administrative home here in FSU’s Special Collections & Archives division. One main activity of SSDN is collecting metadata for digital collections around the state and passing it along to DPLA for inclusion in their portal.
Way back in 2018 we had our first harvest of 94,557 records from 4 partners. Since then contributions have more than tripled to 319,310 records from 28 partners. With this increase have come growing pains. Execution time for a harvest has ballooned to around six hours. More burdensome is that the entire harvest runs as a single process. Any errors occurring along the way require restarting the harvest from the beginning.
PA Digital takes the credit for recognizing that harvest tasks look a lot like what the technology sector have identified as Extract, Transform, Load (ETL) processes. There are many ETL software solutions available. PA Digital moved their metadata collection activities into Apache Airflow, and now SSDN is following suit.
Moving into Airflow has many advantages. One is the design philosophy of tasks within it. Airflow manages the execution of steps, so they don’t have to be bundled together into one monolithic process. Ideally they’re separated into as small of steps as possible, so that recovering from an error is easier to manage, and can be fixed as the rest of the workflow continues running. Airflow also uses a web-based dashboard to monitor and control harvest tasks. This makes identifying errors in the process easier.
SSDN is excited for this evolution in their harvesting infrastructure. The first live harvest is scheduled for November of this year.