Metadata Discussion Group: 2011

Wednesday, August 17, 2011

We've moved!

The IUB Metadata Discussion Group blog has been migrated to the IUB Libraries' blog service.

You may now find us at: https://blogs.libraries.iub.edu/metadata/.

Be sure to subscribe to the new RSS feed: https://blogs.libraries.iub.edu/metadata/feed/.

The archive will remain here for the time being but all future posts will appear at the link above.

Thanks!

Monday, March 28, 2011

Summary of MDG Session, 03-10-11

Articles discussed:

Doctorow, Cory. (2001) “Metacrap: putting the torch to seven straw-men of the meta-utopia.” Available online at http://www.well.com/~doctorow/metacrap.htm

Ardö, Anders. (2010) “Can We Trust Web Page Metadata?” Journal of Library Metadata, 10: 1, 58-74. Available online at http://dx.doi.org/10.1080/19386380903547008.

Moderator: Dot Porter, Associate Director for Digital Library Content & Services, Digital Library Program

March's discussion began with an observation: both articles discuss pulling metadata from webpages. Libraries are typically in the business of pushing metadata. Ardö's article confirms the kinds of webpage metadata problems Doctorow identified nine years earlier. A non-cataloger seemed surprised that the quality of web metadata even needed to be studied at all—of course it's horrible! Search engines have had the task of developing and enhancing smarter search algorithms and "did you mean __" services in order to combat poor or misleading metadata. Most search engine algorithms ignore metadata completely because web page metadata is unreliable.

The discussion diverted to the subject of crosswalking. A participant mentioned the evolution of the way library folks define the term metadata—who wasn't tired of hearing the "data about data" definition parroted ad nauseam?—into something much more. The change seems to have coincided with the cataloging world's increasing familiarity with XML and XML technologies. A better understanding of XML changed the cataloging world's perception of how library metadata might become more web-ready and accessible.

One participant wondered: if library catalogs had more Google-like search capabilities, would discovery be improved? Most participants thought so, although one participant worried that language would be a barrier to search. The web was once a largely English-speaking entity but increasingly, that isn't the case. Search engines have a hard time determining language, especially if there are multiple languages represented on the page. Another participant warned that websites often contain misleading metadata in an attempt to drive up search engine optimization (SEO).

Does buying into the Google model and integrating our resources more with the web mean that libraries will need to accept that search engines impose value judgments on content of the web? Some argued that the library catalog doesn't make value judgments in the same way that Google Scholar can tell you who else cited the article you're reading or in the way that Amazon can tell you what other customers bought in addition to the product you're currently purchasing. As a participant summed up: it is increasingly a world ruled by the "good-enough" principle. Searchers may not find the right thing but they found something and that's good enough.

A participant wondered how on-demand acquisition impacts collections and metadata. Another participant explained that on-demand materials are often accompanied by records of very poor quality. OCLC exacerbates the problem by merging records and retaining data (whether good or bad) in one big, messy record. A definite downside of on-demand purchasing was illustrated in a pilot project IU Libraries embarked on with NetLibrary. Ebooks were bought by the library after three patrons clicked on the link to view the ebook. Within six months, the $100,000 budget for on-demand ebook acquisition was gone. The participant admitted that some of the ebooks that were bought would not have been selected by a collection manager. It is likely that patrons clicked on the link to browse through the book and may have spent all of 30 seconds using the resource before clicking away to something else. As one participant pointed out, how many on-demand purchases could have been avoided if the accompanying metadata on the title splash page had included a table of contents?

The fact that users of the web expect all information to be free was also discussed. Good metadata isn't free, nor is the hosting of resources. This lead to the question: how do young information seekers prefer to read? Do they read research papers and novels online? Do they seek out concise articles and skim for the bullet points? Are our assumptions about the information habits of undergraduate students correct?

The discussion moved onto a topic started by the statement: the internet has filters that users are often unaware of. Participants wondered about the polarizing effect this might have on users' perception of information. Search engines learn what you search for and display related ads. How might this skew broad understanding of a topic if search engines are putting blinders on search results in a similar way? One must go out of one's way to seek out an opposing view point because one isn't likely to see it while idly browsing the web.

In an attempt to end the discussion back on topic, the moderator asked, what are the minimum metadata requirements for a resource? A participant cited the CONSER Standard Record for Serials as an example of an effort to establish minimum metadata requirements. This standard was founded not on the traditional cataloging principle of "describe" but rather on the FRBR principle of "identify." What metadata is needed for a user to find and identify a resource? Implementing the CONSER Standard Record increased production in the IUL serials cataloging unit. It was conceded that minimum metadata requirements may differ depending upon the collection, material, owning institution, etc. One thing that was apparent, whatever the standard, is the need for a universal name authority file.

Tuesday, March 8, 2011

Summary of MDG Session, 02-03-11

Article discussed: Ascher, James P. (Fall 2009) "Progressing toward Bibliography; or, Organic Growth in the Bibliographic Record." RBM: a Journal of Rare Books, Manuscripts, and Cultural Heritage vol. 10, no. 2. Available online at http://progressivebibliography.org/wp-content/uploads/2010/06/95.pdf.
Moderators: Lori Dekydtspotter, Rare Books and Special Collections Cataloger, Lilly Library and Whitney Buccicone, Literature Cataloger, Lilly Library

The discussion began with a consideration of traditional cataloging models and how they measure up to assumptions made in the article. Ascher asserts that the cataloging of an item occurs once at full cataloging standards, making the cataloging process very time-intensive on the front end. However, in a shared cataloging utility such as OCLC, even the full-level cataloging of PCC records are often further revised by other member institutions. Many types of items are cataloged at a minimal level, perhaps because there is a mounting backlog, perhaps due to the nature of the collection. A cataloger with a government documents background pointed out the importance of controlling corporate body names and other headings in OCLC records—headings that are often left uncontrolled by PCC creators, thus requiring enhancement.

Mention of OCLC turned the conversation to a lamentation of the limits of cataloging tools. Rich copy-specific information that is required of special collections cataloging is handled poorly in OCLC, which is perhaps the price special libraries must pay by coming into the fold. Participants discussed problems that often arise when machines do the work of record cleanup and enhancement. OCLC's policy is to accept records from everyone and then merge records as needed—this makes a mess of item-level description. Clearly, we can't rely solely upon machines for metadata enrichment. It is unclear as to how applying the FRBR model to record creation might aid (or hinder) item-level metadata creation in a shared cataloging environment such as OCLC.

Not all of the discussion of OCLC as a tool for progressive bibliography was negative. It was observed that the infrastructure to support progressive bibliography seems to be in place; however, catalogers are not in the habit of using OCLC as a progressive cataloging tool. As one participant observed, technical services units routinely leave books sitting on frontlog shelves for six months, allowing time for full-level records to appear in OCLC. Another participant theorized that this “must-catalog-at-full-level” attitude in technical services departments, from acquisitions to cataloging to processing, may be linked to the fact that technical services still envisions records as paper files. We aren't printing cards anymore—a workflow in which cataloging would have to be full and complete—so why do we feel compelled to treat cataloging as a touch-it-once operation?

This isn't to say that progressive cataloging simply doesn't happen in technical services departments. As one participant pointed out, format often drives the need for progressive cataloging. For instance, serials catalogers constantly touch and retouch serial records due to the transitory nature of continuing resources. The cataloging of government documents sometimes requires retouching fully cataloged records without having the items in hand. Other times, there are collection management concerns that trigger the need to retouch records, for example, moving collections to an auxiliary off-site storage facility. Adding contents notes to the records of older multi-volume reference works makes requesting specific volumes possible. It seems that specific issues relating to material type and collection management lend themselves to progressive cataloging workflows.

The discussion turned to the topic of using non-catalogers—students, researchers and subject specialists—to create and enhance bibliographic records. Catalogers pointed out that, to a degree, this happens already. Examples included: cataloging units hiring graduate students with language expertise to create and edit records; the Library of Congress's crowd-sourcing of photograph tagging in the Flickr Commons; and LibraryThing’s crowd-sourcing of name authority work (among many other things) in the Common Knowledge system.

Living in an increasingly linked world, it is perhaps inevitable that participants were interested in how bibliographic records might be enhanced to include URLs linking to relevant information such as inventories, relevant blog posts about the item, etc. While the group didn't seem opposed to the idea, there were a couple caveats. In our records, we must be very clear about what those links are pointing to and how the links relate to the resource being described. And what happens when links break? Are reporting tools in place to notify catalogers when a PURL is broken?

We can't be sure how researchers will use our metadata in the future. From this idea, an interesting discussion arose regarding expected usage versus unanticipated usage of collections. Progressive bibliography ensures that bibliographic records remain relevant to current research trends. Researchers themselves are changing. A participant described an evolution in the type of researchers who are accessing special collections: patrons are less and less elite, serious researchers who are adept at navigating often complex discovery portals. Being able to serve both experienced and inexperienced researchers is paramount to exposing collections in a world in which collections are increasingly visible to the web.

Another idea that was explored was one of "on demand" progressive cataloging. A serials cataloger described how PCC/CONSER provides a list of available journals that have not yet been cataloged as a way of bolstering cooperative cataloging efforts. Another participant described two distinct queues for archives processing: one queue needing immediate, full-level processing and description, and another minimally processed queue awaiting full description and access. Into the former queue fall “on demand” record enrichment, driven by events, exhibitions, etc., for example, editing records relating to a well-known director's work shortly before that director appears at an event on campus. Following up on this idea, a rare books cataloger noted that "on demand" progressive bibliography often occurs at the request of donors who may want addition access points included in records for family members.

The meeting closed with a discussion of how catalogers of both general and special collections might implement a progressive bibliography/cataloging workflow. Everyone agreed that there was a need for more qualitative studies on the cost effectiveness of current MARC metadata production. One participant observed that non-MARC metadata project managers seem to be more amenable to using students and researchers to enhance metadata, leaving the initial processing and organization of materials to archivists and catalogers. Cited examples included the Victorian Women Writers Project, the Chymistry of Sir Isaac Newton, and much of the archival description and EAD creation in a number of archival units at the IUB campus. Are poorly designed tools the chief barrier to non-catalogers creating MARC metadata? Another participant responded that it's not just that the MARC metadata community needs better tools—cataloging needs simpler rules. Cataloging content standards such as AACR2, LCSH, and indeed, MARC itself are incredibly difficult to master. Is RDA an improvement over past content standards from a progressive bibliography perspective? Maybe. RDA's insistence on exact transcription of title page elements may provide someone who is enhancing the record at a later date with important clues about the author or publication that would otherwise be lost with AACR2 transcription rules. This RDA rule makes it easier to enhance a record or do NACO work at a later time without having the item in hand.

Metadata Discussion Group