Tuesday, January 22, 2013

Do you trust GBIF or the revising author?

I was playing around with GBIF the other day (=procrastinating) and found an interesting case. I was looking to see what records exist in GBIF for the taxa I have revised. It turns out that only two institutions reliably put their records on GBIF [at least for the insects that I care]: INBIO and SEMC (the Entomology Division of the Biodiversity Institute). I searched for records of the neotropical genus Ocyolinus, a genus I revised a few years back. It turns out that according to GBIF, there is a specimen in Texas!

Looking at the specific data:

The specimen has a barcode label of 72395 and a quick look at my paper (see below) revealed that the specimen is actually from Costa Rica, which makes sense for a taxon with neotropical distribution:

Of course this is obviously a data entry error and I do not mean to pile dirt on my friends at SEMC. However, the problem is this: If I am writing a paper on the distributions of animals (see previous post), I will probably not check the revision of the genus and I will assume that the record is correct.
GBIF does have a Feedback button (and I used it) but I am wondering if it will be wise to have some sort of control mechanisms in place to prevent such errors: e.g. do not allow data for terrestrial organisms in the middle of the ocean, or in this case a neotropical taxon to have a single nearctic record.


  1. [Reply from Tim Robertson, GBIF Secretariat, trobertson@gbif.org]

    Thank you for sharing this, and for using the feedback mechanism to alert the data publisher. While we have to rely on the data holder to correct the mistake, it is normally the case that action is taken.

    It is a challenge to provide an accurate data discovery system with the portal, as the quality of data shared through the network varies greatly. This includes all manner of issues and data modeling challenges ranging from simple data entry issues, abbreviations that require interpretation right through to differing opinions (e.g. taxonomic opinion). The processing that we perform for the portal today includes basic validation of fields (e.g. ensuring a date is viable), scientific name interpretation (there are some >12 million "verbatim" names observed) and verification that fields do not contradict each other (such as coordinates falling in a stated country [1]).

    This catches many issues, but does not catch the likes of what you highlight here. We don't have complete and accurate distributions for taxa with which to perform verification, ranges change over time and GBIF share data covering a century, and there will be outliers such as species invasions that need to update the ranges. Therefore any verification could only be used to flag potentially suspicious cases. That said, distributions, or some other rule-based system would be an area to explore in quality control routines to improve the service.

    Regarding the feedback mechanism, we anticipate offering an annotation system, allowing users to flag these issues on records directly. Through "peer review" of the annotation, action could be taken on the records to collectively clean the data. Annotated records could then be represented differently so users would be enticed to inspect the content further to ensure it is fit for their uses.

    I hope this helps provide some insights into the quality routines and future directions - it is an area GBIF are committed to improving. GBIF welcome contributions and guidance on all aspects of data sharing, so thanks again for taking the time to write this post.

    [1] http://gbif.blogspot.dk/2011/05/here-be-dragons-mapping-occurrence-data.html

  2. Thanks for taking the time to reply here Tim. I appreciate your input and clarifying how the data process works at the GBIF level. Regarding the feedback mechanism, I think the ability to flag records will be great. Probably much more efficient than the current tag- email system.

  3. Stelios-
    I appreciate your interest in/concern with the idea of data vetting (or filtering, or what have you) on the part of GBIF. I myself am concerned with the proliferation of multiple, unlinked, slightly different versions of data served for the same object (some specimen data captured and provided by the SMEC is also served by INBio, but is not necessarily identical or transparent), but that's a different issue.

    As the chagrined data provider in the above example, I have to say this is a simple case of "garbage in, garbage out"- the data we provided was just flat out wrong. It may have been a simple data entry or editing error (though conceiving of a plausible scenario/mechanism in this case is difficult) or it may be an artifact of migrating inhumanly large amounts of data across several software platforms.

    Regardless, with a database containing nearly 1,000,000 records accumulated by dozens of people over the last 15 years, a certain amount of error is to be expected. Of course we do our best to provide "clean" data, but as with so many other things, caveat emptor.

    cheers, zack

    1. Zack - I will be the first to acknowledge mistakes in data entry: the above example with terrestrial organisms in the middle of the ocean was not completely random as I have placed there several California beetles by forgetting a - in front of the coordinates.

      Perhaps I am wrong but my philosophy is to make data available, even if there is a percentage of error in these data. In that respect I am grateful that SEMC is putting data online whereas other comparable institutions do not.

      The issue of multiple slightly different entries for the same data is problematic but it can be resolved when we all agree on unique specimen identifies. Immediately after we discover the rainbow producing unicorns. I think this is not GBIF fault but really has to do with how the systematic community operates.

  4. Regarding rainbow producing unicorns, an outsider might struggle to realise that "KU:SEMC:72395" and "SM0072395 (1 SEMC)" are the same thing. The lack of a consistent way to refer to the same specimen makes it hard to link all this disparate information together. In an ideal world GBIF would "know" that your paper cites this specimen, so users could go from the GBIF page to the paper and evaluate the evidence for themselves.

    1. Right, and I share here some of the blame here as the revising author -- but I was simply writing down the human readable information on the barcode. Not sure the 0s are gone in front of the number at GBIF or why the institution label is different.