Thursday, September 24, 2009

Summary of MDG Session, 9-17-09

Article Read: Schaffner, Jennifer. (May 2009). "The Metadata is the Interface: Better Description for Better Discovery of Archives and Special Collections." Report produced by OCLC Research. Published online at: http://www.oclc.org/programs/publications/reports/2009-06.pdf.


An online, user editable resource list accompanying this report can be found at https://oclcresearch.webjunction.org/mdidresourcelist.


While questions regarding terminology in Metadata Discussion Group sessions often focus on techological terms, this month they focused on terms from the Archives sphere not commonly used in libraries. RLIN AMC was explained as an RLIN database with archives format MARC records (before format integration), ISAD as the archival parallel to ISBD, and fonds as sets of materials organically created by an individual, family, or organization during the course of its regular work.

The group began the primary discussion by considering the third sentence in the report's Introduction, "These days we are writing finding aids and cataloging collections largely to be discovered by search engines." Participants wondered if this statement was accurate, and if so what it meant for our descriptive practices. The first reaction expressed was "So what?" OCLC records are exposed to Google through WorldCat.org - does this mean we're already starting to recognize the importance of search engine exposure? Another participant wondered if this statement were true for all classes of users - we certainly have many different types, and presumably the studies cited in the report refer to different groups as well. Different types of users need different types of discovery tools. Regardless, there is a recognition that recent activities reflect a big paradigm shift for special collections – they’re no longer “elite” and only for serious researchers with letters of recommendation in order to see them. In wondering if our descriptive practices need to change to reflect this new user base and new discovery environments, participants noted that there are efforts ongoing to pull more out of library and archives-generated metadata, including structured vocabularies such as LCSH.

The discussion then turned to the report's presentation of users' interest in the "aboutness" of resources. How do we go about supporting this? If we digitize everything will that help? For textual records, relevancy ranking could definitely have an impact. But we can’t have it both ways – getting some level of description out quickly and describing things robustly seem to be antithetical. Can we do this in two phases – first get it out there, then let the scholar figure out what it’s about? Do archivists and catalogers have the background knowledge to do the “aboutness” cataloging?

User-supplied metadata could certainly be part of this solution. At SAA last month, there was a session on Web 2.0 where one repository that presented touted the importance of user-supplied metadata for some of their materials. The repository reported that the user contributions needed some level of vetting but overall they were useful. It was noted that just scanning is not enough, though – not all resources are textual, those that are can be handwritten, and in languages other than English, both of which can pose challenges to automated transcription (OCR).

The group then wondered what other factors could be used in relevancy ranking algorithms, which libraries are notoriously suspicious of. Participants found the idea in p. 8 of the report that higher levels in a multi-level description be weighted more heavily intriguing. It was noted that perhaps the most common factors for relevance ranking are those that libraries don't traditionally collect - number of times cited, checked out, clicked on in a search result set. Relationships between texts in print not as robust as those on the Web, and this might be evidenced by the fact that Google Book Search ranking doesn't seem to be as effective as the Google Search Engine ranking. Personal
names, place names, and events might be weighted more heavily, as this report suggests those things are of primary interest to users. We could also leverage our existing controlled vocabularies by weighting terms in them more heavily than terms that are not, and "explode" queries in full text corpuses to also include synonyms, and change search terms in systems with items cataloged with controlled vocabularies to meet the terms in those vocabularies. Participants debated the degree to which the system should suggest alternatives vs. making changes to queries and telling the user about it after it's done.

The discussion then turned to a frequent topic in "future of libraries" conversation today - getting our resources out to where the users are. Scholars in general make reasonable use of specialized portals, but not everyone knows how to do that. Can we “train” our users to go to IU resources if they’re in the IU community? Many present think this approach is nearly hopeless. Could we guide users to appropriate online resources based on their status? Some participants noted that personalization efforts haven't been all that effective. We can’t box people into specific disciplines – research is increasingly interdisciplinary. Even if they don’t log in we can capture their search strings, though, and potentially use this data. We could count how many times something was looked at, use this in relevance ranking. This system isn't perfect of course – what's on the first page of results naturally gets clicked more. A click, or a checkout, isn't necessarily a positive review, though - could we capture negative reviews? We certainly would benefit from knowing more about how our resources are used. How extensive/serious is each use? Were things actually read? Could we put up a pop up survey on our web site? Users can write reviews in WorldCat Local, should we do this too, or point people to those reviews? Participants noted there is still a role for the librarian/archivist mediator, helping users to understand what tools are available, then letting them use these tools on their own. When we don’t have “aboutness” in our data, users can miss things, and the much maligned “omniscient archivist” can fill in the gaps.

The session closed with a discussion of the comprehensiveness issue mentioned in the report. If users don't trust our resources if they believe them to be incomplete, what do we do? The quickest answer is "
Never admit it!" No resource is ever truly comprehensive. Libraries certainly have put a positive spin on retroconversion projects, calling them "done" when large pockets of material are still unaddressed.