Metadata Discussion Group: 2009

Thursday, November 5, 2009

Summary of MDG Session, 10-15-09

Article read:

Nunberg, Geoff. "Google Books: A Metadata Train Wreck" Language Log blog, August 29, 2009. http://languagelog.ldc.upenn.edu/nll/?p=1701. Be sure to read through some of the comments, specifically the second comment left by Jon Orwant (Google engineer on the GBS team) on September 1, 2009 @ 1:51 am.
Nunberg, Geoff. "Google's Book Search: A Disaster for Scholars," The Chronicle of Higher Education, August 31, 2009. http://chronicle.com/article/Googles-Book-Search-A/48245/

This month's Metadata Discussion Group began with a discussion of the tone of the blog post and article, and the tone of the rhetoric in the community at large around the Google Book Search project. Participants expressed support for the idea that discussion needs to be reasoned and civil - neither Google nor libraries are all wrong or all right. It is more important to fix identified problems than to point fingers. One participant noted that the difference in tone between the blog post and the slightly later Chronicle article was telling. Numberg’s interest is clearly for the scholars, but this is more obvious in the Chronicle article than in the blog post. The Chronicle article immediately sets up a "this service is bad" tone by listing Elsevier as the first possible future owner! The Chronicle version doesn’t even give Google a chance for keeping this service as its own.

Participants quickly noted the findings of these article underscore what we already know, that there’s a lot of bad cataloging out there! A pre-1900 book might have 20 records in OCLC. The articles suggest the full OCLC or LC catalogs would help this service, and participants noted GBS does in fact have the full OCLC database. But are those really better than what GBS is actually using? An “authoritative” source of metadata is a library-centric view. There is no perfect catalog. There isn’t a “better catalog” for Google to get that would easily solve the problems found here. IU itself contributes to the problem: we send unlinked NOTIS item records with our shipments to Google. One participant noted that the “results of this are catastrophic” but we can’t feasibly do much better. We’re flagging these items to handle on return, but that doesn’t help Google.

Thinking about how to solve these problems led to a theme common in the Metadata Discussion Group sessions - what if we were to open up metadata editing to users? Wikipedia isn’t consistent, surely - would that approach here. A participant noted that OCLC itself is a cooperative venture and there are many inconsistencies there. Institutions futz with records locally and don’t send them back to OCLC. CONSER had a history of record edit wars and catalogers decided they just have to grit their teeth and deal with it.

Regarding date accuracy that received a great deal of space in these articles, a participant noted that expectations for these features in GBS are exceedingly unrealistic. Like one blog commenter posited, a user can’t assume all search results are relevant - one has to evaluate search results yourself from any info resource.

The discussion then turned to GBS' utility as a source for language usage. A participant noted that the traditional way to learn about when, say, "felicity" changed to "happiness," is to check the OED. But how does the person who wrote the OED entry know? Was it a manual process before, and could this change with GBS? Scholars haven’t LOST anything with the advent of GBS - it's just an additional tool for them.

Participants then noted that scholars aren't the only or perhaps even the primary audience for GBS. But should they be? A great deal (though not all - content comes from publisher too) comes from academic libraries who have built their collections primarily in support of scholarly activities. Shouldn't library partnerships come with some sort of responsibility on Google's part to pay attention to scholarly needs? For IU and the CIC and other academic libraries, HathiTrust is attempting to fulfill this role, but is that enough?

The next question the group considered was "Is GBS the 'last library'?" The proposed GBS settlement might stifle competition. However, libraries themselves haven’t shown we can really compete in this area. Enhanced cooperation seems to be the only way we might play a realistic role. Participants wondered whether the monopoly that seems to be emerging is the result of Google pushing others out or a lack of interest by potential competitors. Libraries have been wanting to enter into this area but the technology wasn’t there, then we didn’t have the resources. GBS is at an entirely different scale than libraries can realistically achieve. We’re struggling at IU with how to deal with only 6000 Google rejects.

Discussion then turned to some of the statistics presented by the Google engineer in a comment on the blog post, including the claim of “millions” of problems and BISAC accuracy rate of 90%. Participants guessed we have less than 10% howlers for subject headings in our catalogs, but there certainly are lots of them in there. Lots of redundancy in the MARC record gives more text that could be used to avoid this kind of obvious error. We wondered if Google is using any of this redundant information effectively.

The topic then turned back to whether Google Book Search should spend more effort meeting scholarly needs. What should they do differently to support this kind of user better? First, probably not use just a single classification scheme. Don't necessarily stop using BISAC, but they could also use alternatives - that's what Google is about, more information! They're definitely getting LCSH from MARC records, despite LCSH's limitations. The LCC class number could be used to devise a "primary" subject, and potentially words that aren't elsewhere in the record. Participants noted that as the GBS database grows each individual subject heading will start getting larger and larger result sets.

The session closed with some musings on how the Google and library communities might better learn from one another. The notion of constructive conversation rather than disdain was raised again. Then participants noted that the GBS engineer commenting on the blog post invited comment. Individuals can take advantage of this invitation, and IU as a GBS partner can provide information and start conversations at yet another level.

Thursday, September 24, 2009

Summary of MDG Session, 9-17-09

Article Read: Schaffner, Jennifer. (May 2009). "The Metadata is the Interface: Better Description for Better Discovery of Archives and Special Collections." Report produced by OCLC Research. Published online at: http://www.oclc.org/programs/publications/reports/2009-06.pdf.

An online, user editable resource list accompanying this report can be found at https://oclcresearch.webjunction.org/mdidresourcelist.

While questions regarding terminology in Metadata Discussion Group sessions often focus on techological terms, this month they focused on terms from the Archives sphere not commonly used in libraries. RLIN AMC was explained as an RLIN database with archives format MARC records (before format integration), ISAD as the archival parallel to ISBD, and fonds as sets of materials organically created by an individual, family, or organization during the course of its regular work.

The group began the primary discussion by considering the third sentence in the report's Introduction, "These days we are writing finding aids and cataloging collections largely to be discovered by search engines." Participants wondered if this statement was accurate, and if so what it meant for our descriptive practices. The first reaction expressed was "So what?" OCLC records are exposed to Google through WorldCat.org - does this mean we're already starting to recognize the importance of search engine exposure? Another participant wondered if this statement were true for all classes of users - we certainly have many different types, and presumably the studies cited in the report refer to different groups as well. Different types of users need different types of discovery tools. Regardless, there is a recognition that recent activities reflect a big paradigm shift for special collections – they’re no longer “elite” and only for serious researchers with letters of recommendation in order to see them. In wondering if our descriptive practices need to change to reflect this new user base and new discovery environments, participants noted that there are efforts ongoing to pull more out of library and archives-generated metadata, including structured vocabularies such as LCSH.

The discussion then turned to the report's presentation of users' interest in the "aboutness" of resources. How do we go about supporting this? If we digitize everything will that help? For textual records, relevancy ranking could definitely have an impact. But we can’t have it both ways – getting some level of description out quickly and describing things robustly seem to be antithetical. Can we do this in two phases – first get it out there, then let the scholar figure out what it’s about? Do archivists and catalogers have the background knowledge to do the “aboutness” cataloging?

User-supplied metadata could certainly be part of this solution. At SAA last month, there was a session on Web 2.0 where one repository that presented touted the importance of user-supplied metadata for some of their materials. The repository reported that the user contributions needed some level of vetting but overall they were useful. It was noted that just scanning is not enough, though – not all resources are textual, those that are can be handwritten, and in languages other than English, both of which can pose challenges to automated transcription (OCR).

The group then wondered what other factors could be used in relevancy ranking algorithms, which libraries are notoriously suspicious of. Participants found the idea in p. 8 of the report that higher levels in a multi-level description be weighted more heavily intriguing. It was noted that perhaps the most common factors for relevance ranking are those that libraries don't traditionally collect - number of times cited, checked out, clicked on in a search result set. Relationships between texts in print not as robust as those on the Web, and this might be evidenced by the fact that Google Book Search ranking doesn't seem to be as effective as the Google Search Engine ranking. Personal names, place names, and events might be weighted more heavily, as this report suggests those things are of primary interest to users. We could also leverage our existing controlled vocabularies by weighting terms in them more heavily than terms that are not, and "explode" queries in full text corpuses to also include synonyms, and change search terms in systems with items cataloged with controlled vocabularies to meet the terms in those vocabularies. Participants debated the degree to which the system should suggest alternatives vs. making changes to queries and telling the user about it after it's done.

The discussion then turned to a frequent topic in "future of libraries" conversation today - getting our resources out to where the users are. Scholars in general make reasonable use of specialized portals, but not everyone knows how to do that. Can we “train” our users to go to IU resources if they’re in the IU community? Many present think this approach is nearly hopeless. Could we guide users to appropriate online resources based on their status? Some participants noted that personalization efforts haven't been all that effective. We can’t box people into specific disciplines – research is increasingly interdisciplinary. Even if they don’t log in we can capture their search strings, though, and potentially use this data. We could count how many times something was looked at, use this in relevance ranking. This system isn't perfect of course – what's on the first page of results naturally gets clicked more. A click, or a checkout, isn't necessarily a positive review, though - could we capture negative reviews? We certainly would benefit from knowing more about how our resources are used. How extensive/serious is each use? Were things actually read? Could we put up a pop up survey on our web site? Users can write reviews in WorldCat Local, should we do this too, or point people to those reviews? Participants noted there is still a role for the librarian/archivist mediator, helping users to understand what tools are available, then letting them use these tools on their own. When we don’t have “aboutness” in our data, users can miss things, and the much maligned “omniscient archivist” can fill in the gaps.

The session closed with a discussion of the comprehensiveness issue mentioned in the report. If users don't trust our resources if they believe them to be incomplete, what do we do? The quickest answer is "Never admit it!" No resource is ever truly comprehensive. Libraries certainly have put a positive spin on retroconversion projects, calling them "done" when large pockets of material are still unaddressed.

Saturday, August 8, 2009

Summary of MDG Session, 5-28-09

Article discussed: Smith-Yoshimura, Karen. 2009. "Networking Names." Report produced by OCLC Research.

Terminology issues discussed this month included "cooperative identies hub" (is this OCLC’s term? Yes) and API.

The meat of the session began with a discussion of the statement in the report that a preferred form of a name depends on context. Is this a switch for the library community? National authority files tend not to do it this way, but merging international files raises these same issues – they might transliterate Cyrillic differently for example. The VIAF project is having to deal with this issue. The group believes Canada, where this issue likely is raised a great deal due to its bilingual nature, just picks one form and goes with it.

Context here could mean many things: 1) show different form in different circumstances, 2) include contextual info about the person in the record, etc. For #2, work going on right now (or at least in the planning stages) trying to minimize how much language specific stuff goes into a record. One could then code each field by which language is used in the citation. Library practice includes vernacular forms in 400 fields now – these could then be primary in another language catalog. But the coding doesn’t yet distinguish which is really a cross-reference, and which will be preferred in some other language. So this might not be as useful as it would seem at first.

To achieve the first interpretation of flexible context, a 400 field in an authority record can no longer mean “don’t use this one, use the other one”. Purposes of authority file now: for catalogers to justify headings, for systems to automatically map cross-references. Displaying a name form based on context is definitely a new use case for these authority files.

Participants wondered if we add more information to our authority files to make them useful for other purposes, how do we ensure they still fulfil their primary purpose? Should our authority records become biographies? The group reached basic consensus that adding new stuff to these reocrds won’t substantively take away from the current disambiguation function.

The group then turned to privacy issues raised by the expanding functions of authority files. One individual noted that in the past, researchers went to reference books to find information on people. Has this information moved into the public sphere? An author's birth date is often on the book jacket. Notable people are in Wikipedia. The campus phone book is not private information. In the archival community, context is everything. Overall, the group felt we didn't need to worry too much about the privacy issue – for the most part functionality trumps the privacy issues. We still need to be careful but it looks like we're looking at an evolution of what we think of as privacy. We no longer think privacy = public but not easy to get to. Privacy and access control by obscurity is no longer a viable practice. One solution would be to keep some data private for some period of time. Some authors don’t want to give birth date, middle initial for privacy issues, and might respond better to a situation where this is stored but not openly public. Participants noted that this additional data is generally not needed for justification of headings or cross-references. But with expanded functionality, we'll need expanded data.

One participant wondered almost rhetorically if the authority file should be a list of your works, or also a list of your DUIs? In the archival community especially, the latter helps understand the person. How far should we go?

The Institutional Repository use case was the next topic of discussion. When getting faculty publications, it would help to expand the scope of the authority file at the national level. But at local level, many are already struggling with these issues. Do A&I firms do name collocation? Participants don’t think so.

Participants then wondered about the implication of opening up services to contributions from non-catalogers. Some felt we needed to just do it. Others thought opening to humans was a good idea but buggy machine processes could cause havoc. Even OCLC has a great deal of trouble with batch processing (duplicate detection, etc.), and they’re better at this than anyone else in our sphere. For human edits, the same issues apply as with Wikipedia, but our system don't get as many eyes. What is the right nodel for vetting and access control? Who is an authorized user?

Participants believed we need to keep the identities hub separate from the main name authority file for a while to work out issues before expanding the scope of the authority file significantly. The proposed discussion model in the report (p. 7) will help with the vandalism issue. The proposal flips the authority file model on its head, with lots of people adding data rather than just a few highly trained individuals. A participant wondered if the NAF in the end becomes the identities hub. Maybe the NAF feeds the identities hub instead.

Discussion then moved to the possibility of expanding this model to other things beyond names. Geographic places might benefit, but probably not subjects - this process contradicts the very idea of a controlled vocabulary. One participant noted a hub model could be used to document current linguistic practice with regards to subjects.

The session concluded with participants noting that authority control is the highest value activity catalogers do. The data that’s created by this process is the most useful of our data beyond libraries. We need to coordinate work and not duplicate effort.

Summary of MDG Session, 4-23-09

Article discussed: Nellhaus, Tobin. "XML, TEI, and Digital Libraries in the Humanities." portal: Libraries and the Academy, Vol. 1, No. 3 (2001), pp. 257-277.

The conversation began by addressing terminology issues, as is typical of our Metadata Discussion Group sessions. "Stylesheet" and "boilerplate" were among the unfamiliar terms. One participant noted “expanded linking” is like the “paperless society” - a state much discussed but never remotely achieved.

Discussion of the points in the article began with a participant noting that the TEI is a set of guidelines rather than an official "standard." Is this OK? Can we feel safe using it? The group believed that if there is no “standard” for something then adopting guidelines is OK, with the idea that something better than nothing. Is there a standard that would compete with TEI? Docbook is the only real other option, and it hasn't been well adopted in the cultural heritage community. Participants wondered if the library commuity should push TEI towards standardization.

An interesting question then arose wondering if the TEI's roots in the humanitites made it less useful for other types of material. The problems with drama described in this article would extend to other formats too. What about music? How much should TEI expand into this and other areas?

Discussion at this point moved to how to implement TEI locally. Participants noted that local guidelines are necessary, and should be influenced by other projects. Having a standard or common best practices is powerful but that still leaves lots of room for local interepretation. Local practice is a potential barrier to interoperability - for example, a display stylesheet won’t work any more if you start using tags that aren’t in the stylesheet. Local implementations have to plan ahead of time for how the TEI will be used. In the library community, we create different levels of cataloging – encoding could follow the same model. Participants noted that we should do user studies to guide our local implentations.

The group performed an interesting thought experiment examining the many different ways TEI could be implemented, considering Romeo & Juliet. Begin with a version originally in print. Then someone typed it into Project Gutenberg so it was on the Web. Then someone figured out they needed scene markers so someone had to go back and encode for that. New uses mean we need new encoding. How do we balance adding more value to core stuff rather than doing new stuff? A participant noted that this is not a new problem - metadata has always been dynamic. The TEI tags for very detailed work are there, which makes it very tempting to do more encoding than a project specifies. Take the case of IU presidents' speeches. Do these need TEI markup or is full-text searching enough? It would be fascinating and fun to pull together all sorts of materials – primary, secondary, sound recordings of his speeches. But where is this type of treatment in our overall priorities?

A participant asked to step back and ask what can TEI do that full text searching can’t do. Some answers posed by the group were collocation and disambiguation of names, date searching for letters, pulling out and displaying just stage directions from a dramatic text.

We then returned to the notion of drama. It's hard to deal with plays both as literature and as performance. Is this like us treating sheet music bibliographically vs. archivally? Here the cover could be a work of art, music notation marked up in MusicXML, and text marked up in TEI. Nobody knew of an implementation that deals with this multiplicity of perspectives well. Something text-based has trouble dealing with time, for example. Participants noted that TEI starting to deal with this issue now, bt it's certainly a difficult problem.

A participant wondered what would happen if we were to just put images (or dirty OCR for typewritten originals, certainly not all of our stuff) up and mark it up later? This would be the “more product less process” approach currently in favor in the archival world. It would also be in keeping with current efforts to focus on unique materials and special collections rather than mass-produced and widely held materials.

Participants wondered if Google Book Search and HathiTrust do TEI markup. Nobody in the room new for absolute certain, but we didn’t think so.

The session concluded with a final thought, echoing many earlier conversations by this group: could crowdsourcing (user contributed efforts) be used as a means to help get the markup done?

Summary of MDG Session, 3-28-09

Article discussed: Allinson, Julie, Pete Johnston and Andy Powell. (January 2007) "A Dublin Core Application Profile for Scholarly Works." Ariadne 50.

As usual, the discussion group session began with time to talk about unfamiliar terminology or acronyms/terms from the article that were not fully explained. This month, JISC, UKOLN, and OAI-PMH were covered in this question period.

We then moved to a discussion of whether the list of functional requirements described by this article is really the right one. Topics covered included:

Should preservation be on the list? Governments and libraries generally have this as a requirement? Do researchers? The group believed that overall, yes, researchers need preservation but don’t understand it as such - they just want to find something they've seen again later.
Multiple versions. Preprint, edited version, publisher pdf are all available and need to be managed. But maybe we don’t need to keep them all, but just tell users which they’re looking at. The NISO author version standard out there (in draft maybe?) is setting up a common terminology for us to use. Its important to archive these things. One possible solution: keep final version, and track earlier versions more as personal papers.
What about earlier work products like data sets and excel spreadsheets? How much in-process work can/do we want to save? Data sets could be used by many different publications. We would need to make sure users can get to the final writeup easily without getting bogged down in the preliminary stuff. Managing these earlier work products would shift the focus from the writeup to the researcher. Both are important, maybe we deal with them in separate systems. One member brought up Darwin’s Origin of the Species – the text is online and you can see earlier drafts, how the work evolved over time. The work process has long been an interesting area of research that we could promote more. However, it raises issues of rights management, author control. Should we allow the researcher to be in control of deciding what to deposit? We'd have to have them choose while they’re alive.
Unpublished works have different copyright durations. Are things in the IR published or not? Is a dissertation published or unpublished? Does placing it in a library “publish” it? AACR2 thinks dissertations are unpublished. But does copyright law?
What about peer reviewed status? Is the peer reviewed/non peer reviewed vocabulary we tend to use now good enough? Are there things in the middle? Early versions of a paper won’t have gone through the peer review process, and we need to track that one version is peer reviewed and another is not. Individual journal titles are peer reviewed or not so we can guess a paper's status based on that. But in general we would probably want to get this information from the author - there are columns, etc in peer reviewed journals that aren’t actually peer-reviewed.
Participants noted the requirement to facilitate search and browse - not many of our IR systems now do this all that well!
A participant asked if we should be providing access to these types of material by journal/conference? It’s duplicating work that others do. But for the preservation function this information is important.
Our functional requirements discussion wrapped up with participants noting that “cataloging” in these repositories doesn’t look like cataloging in our OPACs. Is this difference going to bite us later? This article describes data an author could never create. The authors obviously have decided cataloger-created data is worth the time and effort. It would be interesting to hear the rationale behind this decision.

We then turned to discussing the minimum data requirements described in the article.

Some of the minimum requirements seem very high end and difficult to know.
Participants wondered which attributes are listed don’t apply to large numbers of scholarly works. The following were identified: has translation, grant number, references.
The group then wondered if the authors had to be so flexible with minimum metadata requirements to allow authors to deposit their own material. Why wouldn’t authors want to do this? Time and effort seem to be big barriers. Even figuring out what version to deposit takes more time than most researchers care to spend.
Participants wondered how effective OA mandates are. In discussion, it was noted that they don’t make it any easier to deposit, and researchers might think it's still not worth their time. A prticipant quoted data from one scientific conference that said if you publish with them you have to provide all your data. 50% provided the full data. 20% uploaded an empty file just to meet the “upload something” requirement!
Conclusion: better systems are a key to actually collecting and saving this stuff.

The discussion moved to pondering how author involvement in the archiving process is a fundamentally new requirement. We never asked researchers to deposit papers in the University Archives before. How do we decide what’s worth keeping? Should we really preserve all of this stuff? How do we get people to the right stuff? Is this a selection/appraisal issue or a metadata issue? Our final conclusions were that the model described in this article helps with creating more functional systems but doesn’t help with making the system easier to use. Minimum requirements for deposit might just be a first step, but to achieve our greater goals that data would likely need to be enhanced later.

Saturday, March 14, 2009

Summary of MDG Session, 2-12-09

Article discussed: Alexander, Arden and Tracy Meehleib (2001). "The Thesaurus for Graphic Materials: Its History, Use, and Future." Cataloging & Classification Quarterly 31(3/4): 189-212.

February's Metadata Discussion Group session was a lively one. The topic of subject vocabularies beyond LCSH sparked a great deal of interest. The session began with discussion of why a separate subject vocabulary for graphic materials was needed, especially in the Library of Congress. Some participants had even cataloged pictures or posters with LCSH, not knowing that other options existed. Participants realized the need for subject terms that were not in LCSH for describing photographic materials, but recognized the potential to add these terms to LCSH rather than starting a new subject vocabulary. The primary reason for needing a separation of subject vocabularies identified during this discussion was a difference in the level of specificity needed for cataloging visual material as opposed to textual material.

Participants then noted that LCSH and TGM I are structured differently; LCSH is a subject heading list while TGM I is a true thesaurus. While this is an important distinction to understand, the group was uncertain as to the specific implications for practice. Both are standardized vocabularies and are applied in a similar fashion. In the last 15 years LCSH has become more thesaurus-like in standardizing cross-reference structure and describing narrower, broader, and related terms instead of see and see also references.

Overall the discussion group thought that the existence of TGM has struck a reasonable balance between one big general vocabulary and lots of little specific ones. While TGM is specifically focused on graphic images, that is a big space and TGM can be applied in many ways. For a big image collection, a graphic materials-specific vocabulary is a great deal more useful than LCSH would be. The group expected image cataloging (and TGM use) to continue to grow as libraries focus more and more on special collections.

From here, the discussion moved on briefly to comparing the top-down design approach of TGM II with the bottom-up (literary warrant) approach of TGM I. A significant issue with the bottom-up approach was identified - that it is difficult and time-consuming to maintain a robust reference structure for a vocabulary that is constantly growing.

The topic of whether or not to subdivide TGM was a main focus of this month's discussion. A participant noted that the precoordinated approach has its origins in the printed card catalog, where it was necessary. Now that we are in online systems, this approach can be rethought.The subdivided approach takes more time to apply (this was consensus but nobody knew of data to cite) and it's not possible to be as specific with geographic locations in subdivisions than it is with postcoordinated geographic headings. Postcoordinated approaches allow the user to decide which feature is of primary interest, rather than having one selected ahead of time. Subdivisions also introduce redundancy as the same subdivisions are often applied to many main headings. But are there cases when they would be different? Would a TGM I heading ever have a different time period subdivision than a TGM II heading on the same record? Perhaps in the case of a contemporary poster of a historic event? It would be more difficult to make this distinction in a postcoordinated approach. A potential benefit of a precoordinated (subdivided) approach is the creation of a browse index. This is achieved in a different way via faceted browsing with the postcoordinated approach. The group felt strongly that the most important goal was to produce a product that is easy and understandable for our usrs. More user studies are needed to learn more about this issue.

The group then wondered what the literature on precoordinated vs. postcoordinated vocabularies looks like. Is there anything recent? Thomas Mann wrote recently on this topic, but no one was aware offhand of other recent work other than Lois Chan describing FAST.

At this point, the discussion turned to the type of training that would be necessary for someone to effectively apply the TGM. For TGM II (genre), the individual would need some level of background with formats of graphic materials. But for topic, participants thought that the same training to perform subject analysis on textual works would apply to graphical works. For some image materials, it is necessary to become familiar with important buildings and people likely to be in the collection, for example buildings on the IU campus and IU presidents for photographs in the University Archives.

The discussion wrapped up with thoughts on the lack of information inherent in the resource that helps with the cataloging process for graphic materials as opposed to textual materials. Generally images come with something that helps identify the content and its origin. Given at least a small amount of information, a cataloger would apply the same type of research techniques, including those applied for authority work, that are already in place in many cataloging units. Image description could be portrayed as an extension of existing work rather than a departure.

Monday, February 2, 2009

Summary of MDG Session, 1-29-09

Article read: Chris Freeland, Martin Kalfatovic, Jay Paige, and Marc Crozier. (December 2008). "Geocoding LCSH in the Biodiversity Heritage Library." The Code4Lib Journal 5. http://journal.code4lib.org/articles/52

As with many MDG sessions, this one began with a discussion of unfamiliar terminology in the article read. This article contained a few technical terms that members were interested in hearing more about, likely due to the fact that the audience for the Code4Lib Journal is primarily programmers rather than catalogers or metadata specialists. One participant wondered where the term "folksonomy" came from. Nobody had an answer, although some thought it had been around a decade or so, and Clay Shirky's "Ontology is Overrated" was mentioned. (The Wikipedia article on folksonomy credits the term to Thomas Vander Wal in the early 2000's.)

The substance of this month's discussion began by addressing the question: Would you catalog differently if you knew the data were to be used in this way? Participants noted that the burden is on the cataloger to verify and provide information that isn't immediately obvious from the resource itself. The limits of MARC/AACR2 practice (missing geographic headings in some cases) described in the article are very real – if the terms aren't there you can’t build this type of service. If you know the data is going to be used in this way then you make more of an effort to provide it. Participants repeated an often-heard comment about MARC cataloging - that populating the fixed fields takes a great deal of effort, but few systems use them. This discourages catalogers from populating them, which discourages system designers from using them... The current environment doesn't make it easy to justify doing the work to create the structured data that's really needed to provide advanced servcies.

The conversation then turned to where geographic data to support a service like the one described in this article would be in a MARC record, if those records were created with this use in mind. One important point to note is that the level of specificity is different between the coded geographic values (043, country of publication in fixed fields) and what is present in LCSH subdivisions. The former are generally continent/country/state level, while the latter can be much more specific. Discussion of these fields led participants to note that these fields represent different things - the place something is published is of interest in different circumstances than the place something is about. This represents one area (of many) where system designers need to have an in depth understanding of the data. Building a resource with more consistent geographic data (say, always at the state level) would alleviate some of the challenges described in this article, but leave out users who are interested in more granular information than an implementation like this could provide.

Some participants advocated that to promote services like the one described in this article, one should use a vocabulary that's designed specifically for geographic access only for this purpose, such as the Getty Thesaurus of Geographic Names or GeoNet. One advantage of these types of vocabularies is that they are based on "official" data of some sort (for example, the US government, the UN), whereas LCSH is based on literary warrant. LCSH therefore might not match up well with current and detailed places such as those one would ideally want for a resource map interface. Similarly, AACR2 treats some objects with geographic features (for example, buildings) as corporate bodies, which are subject to different rules for cross-references and the like.

Participants noted that there have been significat successes in geographic and user-friendly access in the MARC/AACR2/LCSH stack, however. MARC records for newspapers have a 752 field with semi-structured data listing country-state-county-city. The terms used in this field come from the authority file. The 752 field represents an early example of a field existing in response to user discovery needs. Would it be possible for us to generate this data automatically for other types of resources?

The conversation at this point moved to user behavior in general. A participant noted that at the
recent PCC meeting at ALA Midwinter, Karen Calhoun gave a presenation describing OCLC's recent research on user behavior. Their conclusion is that delivery is becoming more important than discovery. Does this mean libraries should start changing our priorities?

Different types of discovery were then briefly discussed, noting that one wants different things at different times. The subject search serves a different purpose than the keyword search. Especially for scholars, the former is useful for introductory and overview work. When delving deeper, looking for the obscure reference that will serve as a key piece of original research, the latter will be more useful. The "20% rule" for subject cataloging is one reason for this. Are tag clouds of subject headings therefore useful? Participants thought they were for some types of discovery. Other possibilities would be clouds of Table of Contents data and full text. All would have different uses, and for some the cloud presentation might be more effective than others.

A significant proportion of the discussion in the second half of the session revolved around ways to integrate together different types of geographic access. The first topic on this theme was one of granularity - how specific should the geographic heading be? Why shouldn't we provide acces to a famous neighborhood in a big city? Using the structure of a robust geographic vocabulary as part of a discovery system would help with this issue.

The changing of place names and agreed-upon boundaries over time was raised next. A place with a single name might have different bondaries over time. Political change is ongoing, and one place does not simply turn into another; maps are constantly reorganizing. Curated vocabularies such as LCSH and TGN take time to respond to these changes. Is it necessary to update older records when place names change? Participants settled on the standard answer: it depends. For resources such as biological specimens, current place names are likely to be more useful, to assist the researcher with understanding the relationships between them over time. For works about specific places, the place as it existed during the time described is more important.

The next issue raised in the session was that geographic places don't exist in a strict hierarchy. National parks, rivers, and lakes, for example, aren’t within single states. LCSH headings exist for these, and for rivers can be separate for each state the river crosses. Participants were not certain if cross-references existed between the river name and the headings for all states it crossed, which would create a machine-readable link between the two.

It was at this point that GIS technology as a solution came up. By defining everything as a polygon rather than a label with some classification of type ("state," "river," park"), geometry can be used to retrieve places relevant to a specific point. Effectively connecting all of these overlapping but not exclusive things in traditional library authority files would be a challenge. Many other geographic-type units could be used for retrieval, including zip codes, area codes, and congressional districts. These change over time as well, further coplicating the situation.

The final issue raised in connection with geographic access was the notion of places being referred to with different names in different languages. Libraries are increasingly adding cross-references from multiple scripts and languages into authority files. This is a good thing, certainly. The lack of a 1:1 mapping from historic places makes this difficult. Even for the residents of a place, the dominant language changes over time and therefore the "official" name.

The Virtual International Authority File is attempting to address this issue by linking together names for the same places from multiple national authority files. It's a bit unclear what the status of this project is, though. LC and OCLC consistently report progress but no clear indication of when it's going to become a production system.

Wednesday, January 14, 2009

Summary of MDG Session, 12-18-08

Article discussed: Kurth, Marty, David Ruddy, and Nathan Rupp. (2004). "Repurposing MARC metadata: using digital project experience to develop a metadata management design." Library Hi Tech 22(2): 153-165.

The discussion group felt that while it was desirable that the work described in this article was based on theoretical work on metadata management, the explanation of the metadata management theory, including the concept of enterprise, was not extensive enough to fully understand the connection. It was clear, however, that to do management, you have to do mapping and transformation. Management allows you to rethink and retool. Our group was interested to know what has happened since this article was written. Have they put this into production? What has changed? It appears there is a follow-up article to this one that would be interesting to read.

The article claims that MARC mapping work is representative of the metadata management task as a whole. Choosing metadata standards based on specific project needs is good, and the projects described here demonstrate how to do that. It's easy to imagine a project where you can start with MARC. But what do you do when no MARC already exists? At IU we have experience in many library departments wth projects that re-use existing MARC metadata.

The group identified three possible cases for metadata management for a digital project: have existing MARC, have existing non-MARC, have no existing structured metadata. Are the strategies outlined in this article useful in all three of these cases? We didn't come to a strong conclusion on this issue.

An interesting discussion grew up around the topic of how to deal with legacy (pre-AACR2) MARC records? Institutional memory is likely the best bet, as documentation comparing older practices and current ones is sparse. Politcal boundaries change and places of publication become no longer correct. Some legacy data is easier to deal with, however. An institution could use an authority vendor to update name headings with death dates. Yet certain data elements should be updated over time, but others shouldn’t. The group noted that most metadata work is bibliographic record based and doesn’t do enough with authority records. Making the full authority structure available to the metadata creation staff is sorely needed.

A substantial amount of discussion time was spent on the topic of collection-specific mappings. The benefits of corse are that these get it done, the way you want it. The drawbacks are potentially reduced shareability and interoperability. One has to take the whole scope of the project in mind to make good decisions and worry about what’s really important. Have to keep USER in mind. This is difficult to do, though. We think “the user needs this information” but we should think “how can the user use this system?” One participant noted that we worry too much about the specialized discovery case to the detriment of the generalized one. How much tweaking of metadata mapping is of use? The community seems to swing back and forth over time between the generalized and specialized approaches.

The discussion then turned more theoretical, with thoughts on the changing roles of libraries – specifically, to what degree should we be the intermediary? If the user is on his or her own, should this change the way we provide access to information? We do see a great deal of evidence that libraries have moved to a model where users interact directly with information with no active intermediation from us. The system provides the intermediation that staff once did. We expect better technologies to automatically enrich our records in the future to help with this. For us, participants felt it was more important to get something out than to get it perfect. We need to make a better effort to integrate authority control into non-MARC environments. Automated methods will rely on the authority records a great deal. It therefore follows that we should send less time on bibliographic records and more on authority work. The MARC world is certainly moving in this direction, with professional catalogers doing more high-value activity, leaving the lower-value tasks to machines or lower-level staff. Mapping activities are an example of the higher-value activity, as seen in this article.

This article describes the most common transformation as MARC to simple DC. To make sure information gets into the right DC fields, one need to understand DC. Those doing the mapping must ask - what is the essential information to go in DC? What really identifies rather than just describes? The role of the cataloger would be to oversee the transformation process, to make sure it works correctly. This would need to happen both on the content end and the technical end.

What should relationship of metadata staff to technical staff be? Metadata staff understand both the source and the target data. They would still have to correct things in the output in the end. It certainly helps if the technical staff understand the data as well. Similarly, metadata staff need to have technical skills. For metadata staff, understanding non-standard source data can be a big challenge. The Bradley films are an example of these challenges here at IU. Each set of materials will have different balance of effort spent on it, based on perceived importance and use. Mapping often unearths mistakes in the original metadata. We must get the best bang for our buck by spending more time on the information that’s really important for the users, and leave the rest alone. Effective projects will also need the involvement of collection development staff.

Summary of MDG Session, 11-19-08

Article read: Cundiff, Morgan V. (2004). "An Introduction to the Metadata Encoding and Transmission Standard (METS)." Library Hi Tech 22(1): 52-64.

The session began with a question raised: is allowing arbitrary descriptive and administrative metadata formats inside METS documents a good idea? The obvious advantage is that it makes METS very versatile. But this could also limit its scope – does that make METS only for digitized versions of physical things, excluding born digital material? The group as a whole didn't believe this was an inherent limitation. The ability to add authorized extension schema over time seems to be a good thing, and necessary for the external schema allowance to work.

The flexibility of METS allows it to be used beyond its textual origins – to scores, sound recordings, images, etc. It could potentially be useful beyond libraries, especially to archives and museums. To balance this flexibility, is knowing some sort of structured metadata is being presented enough to ensure a reasonable level of interoperability?

The discussion then turned to the TYPE attribute on <div>, a topic much discussed in the METS community. How does a METS implementer know what values to use? An organization will presumably develop its own practice but the practices won’t be the same across institutions. A clever name for this was suggested: “plantation” metadata – each place can develop their own.

Are there lessons from library cataloging that could help with this problem? Institutions dealing with the same types of material could join together and harmonize practices. METS Profiles provide the means for documenting this, but they don’t really encourage collaboration. Perhaps the expectation is that the metadata marketplace will converge, and those going their own way will lose out some significant benefits, and see it in their best interest to collaborate.

This line of thought led to the question - How did OCLC/LC/the library community get standardized in the first place? Probably because individuals would write up their own rules, then share them. Eventually these rules became shared practice. Maybe this same shift will happen when sharing really becomes a priority. Diverse practices will converge when people really want them to.

A question was then raised about when METS should be used instead of MARC. When is MARC not enough? A participant made the analogy that this was like comparing a plantation to a video arcade. The two are for different purposes, and METS can include descriptive metadata in any format, including MARC. If you want to allow a certain type of searching, for example, a user wants to search for a recording by a certain group, saying METS is better than MARC doesn't make sense. The descriptive metadata schema used within METS is what is going to make the difference in this case, not the use of METS itself. An implementer will still need good descriptive information.

Participants then noted that we had been talking about systems, but we need to talk more about people. Conversations between communities with different practices will help improve interoperability. Can we standardize access points? To do this we would need to develop vocabularies collaboratively between communities, and talk more so that we understand each other’s point of view.

One participant made an extremely astute observation that the structure of METS makes it seem that it wasn't designed to be used directly by people. While metadata specialists often need to look at METS, and plan for what METS produced by an institution should look like, the commenter is correct that for the most part, METS is intended for machine consumption. A developer present noted that we could write an application that does a lot of what METS does without actually storing it in XML/METS – but the benefit of METS is abstracting out one more layer. Coming full circle to the flexibility issue from earlier in the discussion, it was noted that it is difficult to make standard METS tools (including parsers and generators) due to the almost infinite practices that must be accommodated. This led to the thought that perhaps METS could go much farther in being machine-friendly than it already is. That's a scary thought to metadata specialists who work with it!

Metadata Discussion Group