[webservices] A question of location ID, how to represent empty IDs in XML?

Thu Jul 24 11:43:31 PDT 2014

Hi Chad/Philip,

thanks for reviving this discussion on the appropriate mailing list.

Chad Trabant [07/23/14 19:30]:
> Some background: In the SEED channel naming scheme there is a
> hierarchy of network, station, location and channel identifiers.  Of
> these, it is only the location ID that is commonly accepted to be
> empty.  In the SEED format the location ID is a two-character field,
> where the value is left justified and padded with spaces if needed.
> When the value is empty the field is simply two spaces of padding.
>
> Historically, and presumably to avoid having an empty location ID,
> the DMC has represented “empty” location IDs as a string of two
> spaces.

Note that the padding spaces do not form part of the location code 
string itself, according to the SEED specification, which only allows 
alphanumeric characters.

Actually the location code is treated in the SEED specification not 
differently than e.g. a station code, from which trailing spaces are 
removed in every software that I know of.

BTW, I think the two spaces are not there to avoid having an empty 
location ID, but are a relict from Fortran 77 days. :)

> Following this practice, we express this in StationXML by setting
> the locationCode attribute to a string of two spaces.  We have done
> this so long we sometimes forget that it is not compliant with a
> strict reading of SEED, at best it falls into the vagaries of SEED,
> on the other hand we have been doing it for years with no apparent
> problems (in fact it has helpfully avoided an empty core
> identifier).

On the other hand, even in the IRIS ecosystem the empty location code is 
prominently used as empty string. Not everywhere, but e.g. the 
well-known rdseed program removes the trailing spaces when reading SEED, 
resulting in an empty C string if there are two padding spaces in the 
location code field. A very natural way of dealing with the trailing 
spaces, especially in view of the clear specifications in the SEED 
manual. Also in the IRIS BUD file name convention (e.g. [1]), empty 
location codes become empty strings, with no apparent problems with 
mapping or otherwise.

> There now exists another fdsnws-station implementation that returns
> StationXML with the locationCode attribute set to an empty string
> when the SEED value is empty.  The justification is that this
> follows the SEED rules of trimming the padding spaces from the
> values.
>
> Unfortunately this means there are now flavors of StationXML that
> are incompatible in the core channel name identifiers.  In other
> words, two StationXML documents for the same SEED channel appear,
> without extra field translation, to be different channels.

This depends on how you evaluate the location code. If you simply follow 
the SEED specification and always trim the location code, like e.g. 
ObsPy and rdseed do, the problem you describe is avoided altogether.

Of course, the requirement for removing trailing white space doesn't 
come without the cost of a few more CPU cycles. But if that were an 
issue we wouldn't be using XML, would we? Also, this rule would need to 
be written into the future specification of FDSN StationXML.

> As most of you are users of SEED and StationXML metadata (at some
> level) and some of you have written code to parse these formats and
> manage the data returned by the DMC and other FDSN data centers, we
> are asking for your input regarding the potential solutions.
>
> Here are the options being considered for mapping an empty location
> ID in SEED to StationXML:
>
> 1) Set locationCode to two spaces.  While the DMC and users have
> been using this for a long while, it is not precisely the SEED value
> (but the mapping could be formalized).  Also, whitespace in
> attributes does have some theoretical challenges: the wonky rules for
> XML attributes related to whitespace handling require removal of
> spaces in some cases (we have never heard of problems though).
>
> 2) Set locationCode to an empty string.  This would match the strict
> value present in SEED, an empty identifier.

And would be easy to keep compatible with the two spaces.

This representation is also widely used for a long time already, incl. 
at IRIS (see above).

> 3) Set locationCode to “--“ (two dashes).  This avoids issues with
> whitespace in XML attribute values and avoids issues with an empty
> identifier.  Also, this matches the request mechanisms where “--“ is
> accepted as a synonym for an empty location ID.

Let's not mix request mechanisms with the data format. Data formats are 
a holy grail whereas request mechanisms change more frequently.

Suppose we could retrieve full SEED using the web services. Even then it 
would be equally appropriate to use "--" on the request side. But there 
is no justification for breaking data format compatibility just for 
matching particular request mechanisms.

> All of these solutions are viable in that we can make them work in
> code, it is a matter of choosing one for future FDSN metadata, pick
> your poison so to speak.
>
> In my personal opinion, an empty location ID is an unfortunate quirk
> of SEED that we should rectify in StationXML.  An empty identifier
> can be confused for “unknown” if the programmer is not careful,
> which is semantically very different than “set to empty”.  The
> two-space strings that the DMC is currently using are also not ideal,
> they are hard for humans to read and potentially weird with XML
> rules.  The dashed location ID avoids these issues but requires the
> most change. I also think requiring all readers of StationXML to
> translate (e.g. remove padding) is a bad idea, the values in SEED
> should be uniquely mapped to values in StationXML.

I share your view that the empty location code is not optimal. However, 
the world is not perfect and the empty location code is a fact we have 
to live with and have been able to live with for decades. Seismologists 
have learned how to handle it. Existing software libraries make the 
empty location code as painless as possible. Technically it is a no-issue.

The solution to the empty location code is not to incompatibly break a 
data format without a technical reason but only because of aesthetics. 
Empty strings are represented in XML without problems, particularly if 
used in XML attributes. In fact, it is an advantage of a modern XML 
format that we don't need the padding spaces etc. any more.

Philip Crotwell [07/23/14 20:37]:
> Years ago we had full SEED. Then because of keeping metadata updated,
> we switched to a separation into dataless SEED + miniseed. Now,
> because of the complexities and limitations of dataless SEED, the
> future looks like StationXML + miniseed. I am all for this change,
> but how the location id is resolved really needs to address not just
> what do we do in StationXML, but what do we do in StationXML +
> miniseed.
>
> I also lean towards "--" for the simple reason that there are so many
> instances where I have been bitten by spaces or nulls. Even though I
> know about this, I still get caught. File names, urls, user gui
> displays, etc all have problems with spaces nor nulls and as a
> practical matter it is harder to see something that isn't there than
>  something that is there. Furthermore, using null or space-space is
> really hard as a command line argument in the shell. That said, "--"
>  already means "long option name" in many *nix programs, so if we
> were starting from scratch, underscores like "__" might be a better
> choice. The SEED manual already lists underscore as a separate item
> in the flags section (p32), so maybe worth considering.

In all of the above cases it is the interfaces that have to deal with 
the empty location code. I agree that an empty string is not always easy 
to visualize, but we know how to deal with it. Nothing prevents us from 
using "--" or "__" in GUIs or external formats or input to the fdsnws's. 
I myself use "__" e.g. in pick lists for ease of visualization, 
awk/grep'ing etc.; but that has nothing to do with the XML or SEED 
representation. The same is true for the request formats; as long as the 
user knows how to explicitly specify an empty location code, it's fine.

> But if option 3 is choosen, would there be any possibility of
> amending the SEED spec so that "--" is actually valid within the
> location id field, with the caveat that it is synonymous with
> space-space/null, but "--" is the preferred value?

This would mean that GE.UGM.--.BHZ and GE.UGM..BHZ are equivalent, in 
fact: identical stream ID's. Technically this is feasible. But are the 
downstream software repercussions, let alone the confusion among the 
data users a price we are willing to pay? I don't think so.

> I realize that doing a global search and replace on a petabyte of
> miniseed data is probably not going to happen, but it would be
> really nice if whatever location id is in StationXML, it is exactly
> 2 characters and is the exact same 2 characters as in miniseed.

On the other hand, the use of XML is a chance to get rid of the fixed 
field values with padding. This may not be relevant today, but it might 
become in the future.

> Frankly the whole idea of making location ids "optional" was a real
> mistake IMHO. I am sure that anyone that has every written code to
> deal with location ids has something that looks like: if (locid ==
> null or locid == "" or locid == "  " or locid == "--") then locid =
> "--" which is just a painfully stupid thing to have to do over and
> over and over again. Grumble grumble grumble.:(

But fortunately you do that only once and wrapping this into a library 
function is a no-brainer.

On a side note, I am curious to know (technically) under what 
circumstances locid==null would evaluate to true, considering

<xs:attribute name="locationCode" type="xs:string" use="required"/>

from the xsd[2].

> Lastly, as far as I can tell the SEED spec doesn't disallow
> null/empty station or channel codes, so addressing that at the same
> time might be wise.

I haven't come across any of those but there it makes sense. Yet I don't 
think we can or should prevent empty location codes. They are a very 
common reality.

> My $0.02, please pick one string, and only one string, and use it
> everywhere.

If "only one string" is a requirement, it is probably the strongest 
argument against a change.

"Only one string" will only work without deviation from the current use 
of SEED location code. We can't recode the archives, let alone the local 
archives users have built for their work over the years. Well, 
technically it could be done, but I think we all agree that we don't 
want to, as this would have to involve not only (Mini)SEED waveform data 
but also meta data and parametric data. How about... QuakeML archives? 
Datalogger firmware? We can't change all of that and if we add e.g. "--" 
to the range of *possible* location codes, we still have to continue to 
"forever" support the other representations in order to be backward 
compatible.

Generally speaking, it is good to discuss future possibilities for 
channel naming conventions, not only with respect to the location code. 
But the naming should ideally be independent of the used data formats. 
XML is a big step towards becoming less dependent on the limits imposed 
by SEED, but we are not going to get rid of SEED for many years to come.

Actually we are currently seeking to solve a particular incompatibility 
between FDSN StationXML produced by different services needs to be 
solved, but technically that is much, *much* easier to achieve than the 
introduction of a new and incompatible channel naming. I would welcome 
an intensified discussion on the latter, but not in the context of the 
current FDSN StationXML or web services.

It's actually quite strange that already now, early after the 
introduction of FDSN StationXML, we are not only choking over minor 
incompatibilities, but are discussing "solutions" to problems that 
apparently noone had noticed they existed before StationXML... Looks 
like shooting at sparrows with cannons, IMO.

There used to be a IASPEI working group on station codes that even came 
up with a new channel naming "standard"[3], which, however, doesn't seem 
to have gained much acceptance so far. Nevertheless this is the level at 
which changes to channel naming need to be discussed, even though the 
process may be frustratingly slow. But the impact of such a change is 
just too big to be decided ad hoc.

To summarize:

We will not find a future-proof channel naming convention quickly. 
Partial changes, especially if incompatible, should be absolutely avoided.

The particular problem we attempted (and still need) to solve in the 
first place is a location code incompatibility due to differently strict 
adherence to the SEED specification. Not surprisingly I prefer the 
empty-string representation for the empty location code. To be 
pragmatic, I propose the following time line:

* Accept that at least for a transitional period we have to accept the 
existence of space-space and empty location codes.

* During a transitional period, don't change the servers that now 
produce space-space location codes, as that would break compatibility 
with some clients. We want to keep compatibility rather than introducing 
new incompatibility.

* Instead update the clients to accept both space-space and empty 
location codes by trimming trailing spaces if present. This is a 
relatively minor change and IIRC this is on IRIS's agenda already, which 
is highly appreciated.

At this point in time, interoperability is restored, even without 
server-side changes. This is important as it may take quite some time 
for the users to actually upgrade their clients; but it doesn't hurt anyone.

* Finally the server upgrades where needed. The decision as to when to 
upgrade the server side can be made once it is considered appropriate; 
there is absolutely no hurry from the client side.

The needed changes for the above proposal are very small compared to the 
huge changes that would be required at every level to implement a new 
channel naming convention. This may (and hopefully will) take place some 
time in the future, but it requires a lot of preparation and 
coordination. I am pretty sure that we will have a considerable number 
of beers in the meantime. ;)

Besides the beers, we should focus on finalizing the specification of 
FDSN StationXML. There are too many under-defined elements even in the 
xsd and the risk of serious incompatibilities is very high.

Cheers
Joachim

[1] http://www.iris.edu/bud_stuff/bud_dir/GE/UGM/UGM.GE..BHZ.2014.205
[2] http://www.fdsn.org/xml/station/fdsn-station-1.0.xsd
[3] http://www.isc.ac.uk/registries/download/IR_implementation.pdf