Monday, February 2, 2009

Summary of MDG Session, 1-29-09

Article read: Chris Freeland, Martin Kalfatovic, Jay Paige, and Marc Crozier. (December 2008). "Geocoding LCSH in the Biodiversity Heritage Library." The Code4Lib Journal 5. http://journal.code4lib.org/articles/52

As with many MDG sessions, this one began with a discussion of unfamiliar terminology in the article read. This article contained a few technical terms that members were interested in hearing more about, likely due to the fact that the audience for the Code4Lib Journal is primarily programmers rather than catalogers or metadata specialists. One participant wondered where the term "folksonomy" came from. Nobody had an answer, although some thought it had been around a decade or so, and Clay Shirky's "Ontology is Overrated" was mentioned. (The Wikipedia article on folksonomy credits the term to Thomas Vander Wal in the early 2000's.)

The substance of this month's discussion began by addressing the question: Would you catalog differently if you knew the data were to be used in this way? Participants noted that the burden is on the cataloger to verify and provide information that isn't immediately obvious from the resource itself. The limits of MARC/AACR2 practice (missing geographic headings in some cases) described in the article are very real – if the terms aren't there you can’t build this type of service. If you know the data is going to be used in this way then you make more of an effort to provide it. Participants repeated an often-heard comment about MARC cataloging - that populating the fixed fields takes a great deal of effort, but few systems use them. This discourages catalogers from populating them, which discourages system designers from using them... The current environment doesn't make it easy to justify doing the work to create the structured data that's really needed to provide advanced servcies.

The conversation then turned to where geographic data to support a service like the one described in this article would be in a MARC record, if those records were created with this use in mind. One important point to note is that the level of specificity is different between the coded geographic values (043, country of publication in fixed fields) and what is present in LCSH subdivisions. The former are generally continent/country/state level, while the latter can be much more specific. Discussion of these fields led participants to note that these fields represent different things - the place something is published is of interest in different circumstances than the place something is about. This represents one area (of many) where system designers need to have an in depth understanding of the data. Building a resource with more consistent geographic data (say, always at the state level) would alleviate some of the challenges described in this article, but leave out users who are interested in more granular information than an implementation like this could provide.

Some participants advocated that to promote services like the one described in this article, one should use a vocabulary that's designed specifically for geographic access only for this purpose, such as the Getty Thesaurus of Geographic Names or GeoNet. One advantage of these types of vocabularies is that they are based on "official" data of some sort (for example, the US government, the UN), whereas LCSH is based on literary warrant. LCSH therefore might not match up well with current and detailed places such as those one would ideally want for a resource map interface. Similarly, AACR2 treats some objects with geographic features (for example, buildings) as corporate bodies, which are subject to different rules for cross-references and the like.

Participants noted that there have been significat successes in geographic and user-friendly access in the MARC/AACR2/LCSH stack, however. MARC records for newspapers have a 752 field with semi-structured data listing country-state-county-city. The terms used in this field come from the authority file. The 752 field represents an early example of a field existing in response to user discovery needs. Would it be possible for us to generate this data automatically for other types of resources?

The conversation at this point moved to user behavior in general. A participant noted that at the
recent PCC meeting at ALA Midwinter, Karen Calhoun gave a presenation describing OCLC's recent research on user behavior. Their conclusion is that delivery is becoming more important than discovery. Does this mean libraries should start changing our priorities?

Different types of discovery were then briefly discussed, noting that one wants different things at different times. The subject search serves a different purpose than the keyword search. Especially for scholars, the former is useful for introductory and overview work. When delving deeper, looking for the obscure reference that will serve as a key piece of original research, the latter will be more useful. The "20% rule" for subject cataloging is one reason for this. Are tag clouds of subject headings therefore useful? Participants thought they were for some types of discovery. Other possibilities would be clouds of Table of Contents data and full text. All would have different uses, and for some the cloud presentation might be more effective than others.

A significant proportion of the discussion in the second half of the session revolved around ways to integrate together different types of geographic access. The first topic on this theme was one of granularity - how specific should the geographic heading be? Why shouldn't we provide acces to a famous neighborhood in a big city? Using the structure of a robust geographic vocabulary as part of a discovery system would help with this issue.

The changing of place names and agreed-upon boundaries over time was raised next. A place with a single name might have different bondaries over time. Political change is ongoing, and one place does not simply turn into another; maps are constantly reorganizing. Curated vocabularies such as LCSH and TGN take time to respond to these changes. Is it necessary to update older records when place names change? Participants settled on the standard answer: it depends. For resources such as biological specimens, current place names are likely to be more useful, to assist the researcher with understanding the relationships between them over time. For works about specific places, the place as it existed during the time described is more important.

The next issue raised in the session was that geographic places don't exist in a strict hierarchy. National parks, rivers, and lakes, for example, aren’t within single states. LCSH headings exist for these, and for rivers can be separate for each state the river crosses. Participants were not certain if cross-references existed between the river name and the headings for all states it crossed, which would create a machine-readable link between the two.

It was at this point that GIS technology as a solution came up. By defining everything as a polygon rather than a label with some classification of type ("state," "river," park"), geometry can be used to retrieve places relevant to a specific point. Effectively connecting all of these overlapping but not exclusive things in traditional library authority files would be a challenge. Many other geographic-type units could be used for retrieval, including zip codes, area codes, and congressional districts. These change over time as well, further coplicating the situation.

The final issue raised in connection with geographic access was the notion of places being referred to with different names in different languages. Libraries are increasingly adding cross-references from multiple scripts and languages into authority files. This is a good thing, certainly. The lack of a 1:1 mapping from historic places makes this difficult. Even for the residents of a place, the dominant language changes over time and therefore the "official" name.

The Virtual International Authority File is attempting to address this issue by linking together names for the same places from multiple national authority files. It's a bit unclear what the status of this project is, though. LC and OCLC consistently report progress but no clear indication of when it's going to become a production system.