Data Services Newsletter

Volume 3 : No 2 : June 2001

IRIS DMC Data Transcription

The Process of Implementing New Technologies

“What doesn’t kill you, makes you stronger”

This article explains the process of data transcription that the IRIS DMC undertook the first six months of 2001. Because we operate a very active archive, this will give insight to how waveform data are transcribed to new media within the new StorageTek Powderhorn mass storage robot, and what is done to ensure that the health of the archive is always in peak condition. This is, after all, the point of doing transcription in the first place.

When data are stored on media (tape, disk or paper), a fact of life is that these media will become outdated, and/or wear out. This happens about every 4-5 years, in our case. For this reason it is accepted fact that there will be a need to read everything back and migrate, or transcribe, all data to new media. This was necessary in 1992 and 1997 and we just recently completed our third transcription.

Currently residing in our mass storage file system, there are about 2.5 million files, totaling about 19 terabytes, where one file contains the entire data generated at one station, for one day. These are what we call “station/day” files. Since we have archived data from over 100 different networks, and span the time period from 1970 to present, we take stewardship of the data very seriously. Equally important is the need to preserve a good sort of the data, so that servicing requests for data is optimized. In the case of the DMC holdings, these are stored in two sort orders: by time, and by station. By doing this, we are able to quickly stage data back for event-based requests like earthquake gathers, or by single station requests. Not only does this sort order ensure that we minimize tape loads into drives, but it also gives us a built-in back up, ensuring that we always have access to data in the event of any bad media under one copy.

The process of transcription begins with staging back all the data from one network for one year, from the old media. These data are then parsed, and exact file sizes for each channel for each station are compared to the Oracle database, where internal synchronization between the database and waveform data is performed. This is useful in determining any inconsistencies between what the database says we have, and what resides in the waveform files. This two-way check verifies that we are internally consistent before we go on to stage 2, synchronizing with the network operators that originally submitted the data. This step is important for verifying that we have all the data that the network operator recorded. (As you might suspect, we also find data that the network operator has submitted, but didn’t intend to, as this is a two-way synchronization). Currently, only IRIS nodes of the Data Management System, which includes Albuquerque Seismic Laboratories, IDA at UCSD, and PASSCAL, are the data collection centers which we have the ability to synchronize our holdings. It is intended that we utilize the synchronization mechanism with others within the FDSN, regional network operators, as well as anyone who submits data to the DMC

Once we have determined that we have all available data for this one year time period, or in some cases as little as two months where data volumes are very large, we begin the process of staging these data to the Powderhorn mass storage machine, where a UNIX file system called SAM-FS, commercially available from LSC, is configured to associate either time sorted or station data within archive_sets, enabling data to be streamed to designated tapes in an efficient manner. Without the ability to control the writes to media, a random sort order would take over. In the case of a request asking for event data, it would likely be at least an order of magnitude more tape mounts. As important as this process is to the health of the archive, we remain very aware that we have to simultaneously service some 4,000 requests per month for data, and by engineering the system for transcription that we have, we have not had any interruption in service to the community.

Because the IRIS DMC continues to expand its holdings at a nonlinear rate annually, we have had to look down the road to even the next period of transcription, and is one reason that we have chosen the StorageTek company to help us, as they have a migration path built into their product cycle that will include being able to simultaneously manage both old and new tape media.

by Rick Benson (IRIS Data Management Center)

13:38:42 v.22510d55