Wednesday, August 17, 2011
We've moved!
You may now find us at: https://blogs.libraries.iub.edu/metadata/.
Be sure to subscribe to the new RSS feed: https://blogs.libraries.iub.edu/metadata/feed/.
The archive will remain here for the time being but all future posts will appear at the link above.
Thanks!
Monday, March 28, 2011
Summary of MDG Session, 03-10-11
- Doctorow, Cory. (2001) “Metacrap: putting the torch to seven straw-men of the meta-utopia.” Available online at http://www.well.com/~doctorow/metacrap.htm
- Ardö, Anders. (2010) “Can We Trust Web Page Metadata?” Journal of Library Metadata, 10: 1, 58-74. Available online at http://dx.doi.org/10.1080/19386380903547008.
March's discussion began with an observation: both articles discuss pulling metadata from webpages. Libraries are typically in the business of pushing metadata. Ardö's article confirms the kinds of webpage metadata problems Doctorow identified nine years earlier. A non-cataloger seemed surprised that the quality of web metadata even needed to be studied at allof course it's horrible! Search engines have had the task of developing and enhancing smarter search algorithms and "did you mean __" services in order to combat poor or misleading metadata. Most search engine algorithms ignore metadata completely because web page metadata is unreliable.
The discussion diverted to the subject of crosswalking. A participant mentioned the evolution of the way library folks define the term metadatawho wasn't tired of hearing the "data about data" definition parroted ad nauseam?into something much more. The change seems to have coincided with the cataloging world's increasing familiarity with XML and XML technologies. A better understanding of XML changed the cataloging world's perception of how library metadata might become more web-ready and accessible.
One participant wondered: if library catalogs had more Google-like search capabilities, would discovery be improved? Most participants thought so, although one participant worried that language would be a barrier to search. The web was once a largely English-speaking entity but increasingly, that isn't the case. Search engines have a hard time determining language, especially if there are multiple languages represented on the page. Another participant warned that websites often contain misleading metadata in an attempt to drive up search engine optimization (SEO).
Does buying into the Google model and integrating our resources more with the web mean that libraries will need to accept that search engines impose value judgments on content of the web? Some argued that the library catalog doesn't make value judgments in the same way that Google Scholar can tell you who else cited the article you're reading or in the way that Amazon can tell you what other customers bought in addition to the product you're currently purchasing. As a participant summed up: it is increasingly a world ruled by the "good-enough" principle. Searchers may not find the right thing but they found something and that's good enough.
A participant wondered how on-demand acquisition impacts collections and metadata. Another participant explained that on-demand materials are often accompanied by records of very poor quality. OCLC exacerbates the problem by merging records and retaining data (whether good or bad) in one big, messy record. A definite downside of on-demand purchasing was illustrated in a pilot project IU Libraries embarked on with NetLibrary. Ebooks were bought by the library after three patrons clicked on the link to view the ebook. Within six months, the $100,000 budget for on-demand ebook acquisition was gone. The participant admitted that some of the ebooks that were bought would not have been selected by a collection manager. It is likely that patrons clicked on the link to browse through the book and may have spent all of 30 seconds using the resource before clicking away to something else. As one participant pointed out, how many on-demand purchases could have been avoided if the accompanying metadata on the title splash page had included a table of contents?
The fact that users of the web expect all information to be free was also discussed. Good metadata isn't free, nor is the hosting of resources. This lead to the question: how do young information seekers prefer to read? Do they read research papers and novels online? Do they seek out concise articles and skim for the bullet points? Are our assumptions about the information habits of undergraduate students correct?
The discussion moved onto a topic started by the statement: the internet has filters that users are often unaware of. Participants wondered about the polarizing effect this might have on users' perception of information. Search engines learn what you search for and display related ads. How might this skew broad understanding of a topic if search engines are putting blinders on search results in a similar way? One must go out of one's way to seek out an opposing view point because one isn't likely to see it while idly browsing the web.
In an attempt to end the discussion back on topic, the moderator asked, what are the minimum metadata requirements for a resource? A participant cited the CONSER Standard Record for Serials as an example of an effort to establish minimum metadata requirements. This standard was founded not on the traditional cataloging principle of "describe" but rather on the FRBR principle of "identify." What metadata is needed for a user to find and identify a resource? Implementing the CONSER Standard Record increased production in the IUL serials cataloging unit. It was conceded that minimum metadata requirements may differ depending upon the collection, material, owning institution, etc. One thing that was apparent, whatever the standard, is the need for a universal name authority file.
Tuesday, March 8, 2011
Summary of MDG Session, 02-03-11
Article discussed: Ascher, James P. (Fall 2009) "Progressing toward Bibliography; or, Organic Growth in the Bibliographic Record." RBM: a Journal of Rare Books, Manuscripts, and Cultural Heritage vol. 10, no. 2. Available online at http://progressivebibliography.org/wp-content/uploads/2010/06/95.pdf.
Moderators: Lori Dekydtspotter, Rare Books and Special Collections Cataloger, Lilly Library and Whitney Buccicone, Literature Cataloger, Lilly Library
The discussion began with a consideration of traditional cataloging models and how they measure up to assumptions made in the article. Ascher asserts that the cataloging of an item occurs once at full cataloging standards, making the cataloging process very time-intensive on the front end. However, in a shared cataloging utility such as OCLC, even the full-level cataloging of PCC records are often further revised by other member institutions. Many types of items are cataloged at a minimal level, perhaps because there is a mounting backlog, perhaps due to the nature of the collection. A cataloger with a government documents background pointed out the importance of controlling corporate body names and other headings in OCLC records—headings that are often left uncontrolled by PCC creators, thus requiring enhancement.
Mention of OCLC turned the conversation to a lamentation of the limits of cataloging tools. Rich copy-specific information that is required of special collections cataloging is handled poorly in OCLC, which is perhaps the price special libraries must pay by coming into the fold. Participants discussed problems that often arise when machines do the work of record cleanup and enhancement. OCLC's policy is to accept records from everyone and then merge records as needed—this makes a mess of item-level description. Clearly, we can't rely solely upon machines for metadata enrichment. It is unclear as to how applying the FRBR model to record creation might aid (or hinder) item-level metadata creation in a shared cataloging environment such as OCLC.
Not all of the discussion of OCLC as a tool for progressive bibliography was negative. It was observed that the infrastructure to support progressive bibliography seems to be in place; however, catalogers are not in the habit of using OCLC as a progressive cataloging tool. As one participant observed, technical services units routinely leave books sitting on frontlog shelves for six months, allowing time for full-level records to appear in OCLC. Another participant theorized that this “must-catalog-at-full-level” attitude in technical services departments, from acquisitions to cataloging to processing, may be linked to the fact that technical services still envisions records as paper files. We aren't printing cards anymore—a workflow in which cataloging would have to be full and complete—so why do we feel compelled to treat cataloging as a touch-it-once operation?
This isn't to say that progressive cataloging simply doesn't happen in technical services departments. As one participant pointed out, format often drives the need for progressive cataloging. For instance, serials catalogers constantly touch and retouch serial records due to the transitory nature of continuing resources. The cataloging of government documents sometimes requires retouching fully cataloged records without having the items in hand. Other times, there are collection management concerns that trigger the need to retouch records, for example, moving collections to an auxiliary off-site storage facility. Adding contents notes to the records of older multi-volume reference works makes requesting specific volumes possible. It seems that specific issues relating to material type and collection management lend themselves to progressive cataloging workflows.
The discussion turned to the topic of using non-catalogers—students, researchers and subject specialists—to create and enhance bibliographic records. Catalogers pointed out that, to a degree, this happens already. Examples included: cataloging units hiring graduate students with language expertise to create and edit records; the Library of Congress's crowd-sourcing of photograph tagging in the Flickr Commons; and LibraryThing’s crowd-sourcing of name authority work (among many other things) in the Common Knowledge system.
Living in an increasingly linked world, it is perhaps inevitable that participants were interested in how bibliographic records might be enhanced to include URLs linking to relevant information such as inventories, relevant blog posts about the item, etc. While the group didn't seem opposed to the idea, there were a couple caveats. In our records, we must be very clear about what those links are pointing to and how the links relate to the resource being described. And what happens when links break? Are reporting tools in place to notify catalogers when a PURL is broken?
We can't be sure how researchers will use our metadata in the future. From this idea, an interesting discussion arose regarding expected usage versus unanticipated usage of collections. Progressive bibliography ensures that bibliographic records remain relevant to current research trends. Researchers themselves are changing. A participant described an evolution in the type of researchers who are accessing special collections: patrons are less and less elite, serious researchers who are adept at navigating often complex discovery portals. Being able to serve both experienced and inexperienced researchers is paramount to exposing collections in a world in which collections are increasingly visible to the web.
Another idea that was explored was one of "on demand" progressive cataloging. A serials cataloger described how PCC/CONSER provides a list of available journals that have not yet been cataloged as a way of bolstering cooperative cataloging efforts. Another participant described two distinct queues for archives processing: one queue needing immediate, full-level processing and description, and another minimally processed queue awaiting full description and access. Into the former queue fall “on demand” record enrichment, driven by events, exhibitions, etc., for example, editing records relating to a well-known director's work shortly before that director appears at an event on campus. Following up on this idea, a rare books cataloger noted that "on demand" progressive bibliography often occurs at the request of donors who may want addition access points included in records for family members.
The meeting closed with a discussion of how catalogers of both general and special collections might implement a progressive bibliography/cataloging workflow. Everyone agreed that there was a need for more qualitative studies on the cost effectiveness of current MARC metadata production. One participant observed that non-MARC metadata project managers seem to be more amenable to using students and researchers to enhance metadata, leaving the initial processing and organization of materials to archivists and catalogers. Cited examples included the Victorian Women Writers Project, the Chymistry of Sir Isaac Newton, and much of the archival description and EAD creation in a number of archival units at the IUB campus. Are poorly designed tools the chief barrier to non-catalogers creating MARC metadata? Another participant responded that it's not just that the MARC metadata community needs better tools—cataloging needs simpler rules. Cataloging content standards such as AACR2, LCSH, and indeed, MARC itself are incredibly difficult to master. Is RDA an improvement over past content standards from a progressive bibliography perspective? Maybe. RDA's insistence on exact transcription of title page elements may provide someone who is enhancing the record at a later date with important clues about the author or publication that would otherwise be lost with AACR2 transcription rules. This RDA rule makes it easier to enhance a record or do NACO work at a later time without having the item in hand.
Thursday, November 5, 2009
Summary of MDG Session, 10-15-09
Article read:
- Nunberg, Geoff. "Google Books: A Metadata Train Wreck" Language Log blog, August 29, 2009. http://languagelog.ldc.upenn.edu/nll/?p=1701. Be sure to read through some of the comments, specifically the second comment left by Jon Orwant (Google engineer on the GBS team) on September 1, 2009 @ 1:51 am.
- Nunberg, Geoff. "Google's Book Search: A Disaster for Scholars," The Chronicle of Higher Education, August 31, 2009. http://chronicle.com/article/Googles-Book-Search-A/48245/
This month's Metadata Discussion Group began with a discussion of the tone of the blog post and article, and the tone of the rhetoric in the community at large around the Google Book Search project. Participants expressed support for the idea that discussion needs to be reasoned and civil - neither Google nor libraries are all wrong or all right. It is more important to fix identified problems than to point fingers. One participant noted that the difference in tone between the blog post and the slightly later Chronicle article was telling. Numberg’s interest is clearly for the scholars, but this is more obvious in the Chronicle article than in the blog post. The Chronicle article immediately sets up a "this service is bad" tone by listing Elsevier as the first possible future owner! The Chronicle version doesn’t even give Google a chance for keeping this service as its own.
Thinking about how to solve these problems led to a theme common in the Metadata Discussion Group sessions - what if we were to open up metadata editing to users? Wikipedia isn’t consistent, surely - would that approach here. A participant noted that OCLC itself is a cooperative venture and there are many inconsistencies there. Institutions futz with records locally and don’t send them back to OCLC. CONSER had a history of record edit wars and catalogers decided they just have to grit their teeth and deal with it.
Participants then noted that scholars aren't the only or perhaps even the primary audience for GBS. But should they be? A great deal (though not all - content comes from publisher too) comes from academic libraries who have built their collections primarily in support of scholarly activities. Shouldn't library partnerships come with some sort of responsibility on Google's part to pay attention to scholarly needs? For IU and the CIC and other academic libraries, HathiTrust is attempting to fulfill this role, but is that enough?
The next question the group considered was
Discussion then turned to some of the statistics presented by the Google engineer in a comment on the blog post, including the claim of “millions” of problems and BISAC accuracy rate of 90%. Participants guessed we have less than 10% howlers for subject headings in our catalogs, but there certainly are lots of them in there. Lots of redundancy in the MARC record gives more text that could be used to avoid this kind of obvious error. We wondered if Google is using any of this redundant information effectively.
The topic then turned back to whether Google Book Search should spend more effort meeting scholarly needs. What should they do differently to support this kind of user better? First, probably not use just a single classification scheme. Don't necessarily stop using BISAC, but they could also use alternatives - that's what Google is about, more information! They're definitely getting LCSH from MARC records, despite LCSH's limitations. The LCC class number could be used to devise a "primary" subject, and potentially words that aren't elsewhere in the record. Participants noted that as the GBS database grows each individual subject heading will start getting larger and larger result sets.
The session closed with some musings on how the Google and library communities might better learn from one another. The notion of constructive conversation rather than disdain was raised again. Then participants noted that the GBS engineer commenting on the blog post invited comment. Individuals can take advantage of this invitation, and IU as a GBS partner can provide information and start conversations at yet another level.
Thursday, September 24, 2009
Summary of MDG Session, 9-17-09
Article Read: Schaffner, Jennifer. (May 2009). "The Metadata is the Interface: Better Description for Better Discovery of Archives and Special Collections." Report produced by OCLC Research. Published online at: http://www.oclc.org/programs/publications/reports/2009-06.pdf.
An online, user editable resource list accompanying this report can be found at https://oclcresearch.webjunction.org/mdidresourcelist.
While questions regarding terminology in Metadata Discussion Group sessions often focus on techological terms, this month they focused on terms from the Archives sphere not commonly used in libraries.
The group began the primary discussion by considering the third sentence in the report's Introduction, "These days we are writing finding aids and cataloging collections largely to be discovered by search engines." Participants wondered if this statement was accurate, and if so what it meant for our descriptive practices. The first reaction expressed was "So what?" OCLC records are exposed to Google through WorldCat.org - does this mean we're already starting to recognize the importance of search engine exposure? Another participant wondered if this statement were true for all classes of users - we certainly have many different types, and presumably the studies cited in the report refer to different groups as well. Different types of users need different types of discovery tools. Regardless, there is a recognition that recent activities reflect a big paradigm shift for special collections – they’re no longer “elite” and only for serious researchers with letters of recommendation in order to see them. In wondering if our descriptive practices need to change to reflect this new user base and new discovery environments, participants noted that there are efforts ongoing to pull more out of library and archives-generated metadata, including structured vocabularies such as LCSH.
User-supplied metadata could certainly be part of this solution. At SAA last month, there was a session on Web 2.0 where one repository that presented touted the importance of user-supplied metadata for some of their materials. The repository reported that the user contributions needed some level of vetting but overall they were useful. It was noted that just scanning is not enough, though – not all resources are textual, those that are can be handwritten, and in languages other than English, both of which can pose challenges to automated transcription (OCR).
The group then wondered what other factors could be used in relevancy ranking algorithms, which libraries are notoriously suspicious of. Participants found the idea in p. 8 of the report that higher levels in a multi-level description be weighted more heavily intriguing. It was noted that perhaps the most common factors for relevance ranking are those that libraries don't traditionally collect - number of times cited, checked out, clicked on in a search result set. Relationships between texts in print not as robust as those on the Web, and this might be evidenced by the fact that Google Book Search ranking doesn't seem to be as effective as the Google Search Engine ranking. Personal names, place names, and events might be weighted more heavily, as this report suggests those things are of primary interest to users. We could also leverage our existing controlled vocabularies by weighting terms in them more heavily than terms that are not, and "explode" queries in full text corpuses to also include synonyms, and change search terms in systems with items cataloged with controlled vocabularies to meet the terms in those vocabularies. Participants debated the degree to which the system should suggest alternatives vs. making changes to queries and telling the user about it after it's done.
The session closed with a discussion of the comprehensiveness issue mentioned in the report. If users don't trust our resources if they believe them to be incomplete, what do we do? The quickest answer is "Never admit it!" No resource is ever truly comprehensive. Libraries certainly have put a positive spin on retroconversion projects, calling them "done" when large pockets of material are still unaddressed.
Saturday, August 8, 2009
Summary of MDG Session, 5-28-09
Terminology issues discussed this month included "cooperative identies hub" (is this OCLC’s term? Yes) and API.
The meat of the session began with a discussion of the statement in the report that a preferred form of a name depends on context. Is this a switch for the library community? National authority files tend not to do it this way, but merging international files raises these same issues – they might transliterate Cyrillic differently for example. The VIAF project is having to deal with this issue. The group believes Canada, where this issue likely is raised a great deal due to its bilingual nature, just picks one form and goes with it.
Context here could mean many things: 1) show different form in different circumstances, 2) include contextual info about the person in the record, etc. For #2, work going on right now (or at least in the planning stages) trying to minimize how much language specific stuff goes into a record. One could then code each field by which language is used in the citation. Library practice includes vernacular forms in 400 fields now – these could then be primary in another language catalog. But the coding doesn’t yet distinguish which is really a cross-reference, and which will be preferred in some other language. So this might not be as useful as it would seem at first.
To achieve the first interpretation of flexible context, a 400 field in an authority record can no longer mean “don’t use this one, use the other one”. Purposes of authority file now: for catalogers to justify headings, for systems to automatically map cross-references. Displaying a name form based on context is definitely a new use case for these authority files.
Participants wondered if we add more information to our authority files to make them useful for other purposes, how do we ensure they still fulfil their primary purpose? Should our authority records become biographies? The group reached basic consensus that adding new stuff to these reocrds won’t substantively take away from the current disambiguation function.
The group then turned to privacy issues raised by the expanding functions of authority files. One individual noted that in the past, researchers went to reference books to find information on people. Has this information moved into the public sphere? An author's birth date is often on the book jacket. Notable people are in Wikipedia. The campus phone book is not private information. In the archival community, context is everything. Overall, the group felt we didn't need to worry too much about the privacy issue – for the most part functionality trumps the privacy issues. We still need to be careful but it looks like we're looking at an evolution of what we think of as privacy. We no longer think privacy = public but not easy to get to. Privacy and access control by obscurity is no longer a viable practice. One solution would be to keep some data private for some period of time. Some authors don’t want to give birth date, middle initial for privacy issues, and might respond better to a situation where this is stored but not openly public. Participants noted that this additional data is generally not needed for justification of headings or cross-references. But with expanded functionality, we'll need expanded data.
One participant wondered almost rhetorically if the authority file should be a list of your works, or also a list of your DUIs? In the archival community especially, the latter helps understand the person. How far should we go?
The Institutional Repository use case was the next topic of discussion. When getting faculty publications, it would help to expand the scope of the authority file at the national level. But at local level, many are already struggling with these issues. Do A&I firms do name collocation? Participants don’t think so.
Participants then wondered about the implication of opening up services to contributions from non-catalogers. Some felt we needed to just do it. Others thought opening to humans was a good idea but buggy machine processes could cause havoc. Even OCLC has a great deal of trouble with batch processing (duplicate detection, etc.), and they’re better at this than anyone else in our sphere. For human edits, the same issues apply as with Wikipedia, but our system don't get as many eyes. What is the right nodel for vetting and access control? Who is an authorized user?
Participants believed we need to keep the identities hub separate from the main name authority file for a while to work out issues before expanding the scope of the authority file significantly. The proposed discussion model in the report (p. 7) will help with the vandalism issue. The proposal flips the authority file model on its head, with lots of people adding data rather than just a few highly trained individuals. A participant wondered if the NAF in the end becomes the identities hub. Maybe the NAF feeds the identities hub instead.
Discussion then moved to the possibility of expanding this model to other things beyond names. Geographic places might benefit, but probably not subjects - this process contradicts the very idea of a controlled vocabulary. One participant noted a hub model could be used to document current linguistic practice with regards to subjects.
The session concluded with participants noting that authority control is the highest value activity catalogers do. The data that’s created by this process is the most useful of our data beyond libraries. We need to coordinate work and not duplicate effort.
Summary of MDG Session, 4-23-09
The conversation began by addressing terminology issues, as is typical of our Metadata Discussion Group sessions. "Stylesheet" and "boilerplate" were among the unfamiliar terms. One participant noted “expanded linking” is like the “paperless society” - a state much discussed but never remotely achieved.
Discussion of the points in the article began with a participant noting that the TEI is a set of guidelines rather than an official "standard." Is this OK? Can we feel safe using it? The group believed that if there is no “standard” for something then adopting guidelines is OK, with the idea that something better than nothing. Is there a standard that would compete with TEI? Docbook is the only real other option, and it hasn't been well adopted in the cultural heritage community. Participants wondered if the library commuity should push TEI towards standardization.
An interesting question then arose wondering if the TEI's roots in the humanitites made it less useful for other types of material. The problems with drama described in this article would extend to other formats too. What about music? How much should TEI expand into this and other areas?
Discussion at this point moved to how to implement TEI locally. Participants noted that local guidelines are necessary, and should be influenced by other projects. Having a standard or common best practices is powerful but that still leaves lots of room for local interepretation. Local practice is a potential barrier to interoperability - for example, a display stylesheet won’t work any more if you start using tags that aren’t in the stylesheet. Local implementations have to plan ahead of time for how the TEI will be used. In the library community, we create different levels of cataloging – encoding could follow the same model. Participants noted that we should do user studies to guide our local implentations.
The group performed an interesting thought experiment examining the many different ways TEI could be implemented, considering Romeo & Juliet. Begin with a version originally in print. Then someone typed it into Project Gutenberg so it was on the Web. Then someone figured out they needed scene markers so someone had to go back and encode for that. New uses mean we need new encoding. How do we balance adding more value to core stuff rather than doing new stuff? A participant noted that this is not a new problem - metadata has always been dynamic. The TEI tags for very detailed work are there, which makes it very tempting to do more encoding than a project specifies. Take the case of IU presidents' speeches. Do these need TEI markup or is full-text searching enough? It would be fascinating and fun to pull together all sorts of materials – primary, secondary, sound recordings of his speeches. But where is this type of treatment in our overall priorities?
A participant asked to step back and ask what can TEI do that full text searching can’t do. Some answers posed by the group were collocation and disambiguation of names, date searching for letters, pulling out and displaying just stage directions from a dramatic text.
We then returned to the notion of drama. It's hard to deal with plays both as literature and as performance. Is this like us treating sheet music bibliographically vs. archivally? Here the cover could be a work of art, music notation marked up in MusicXML, and text marked up in TEI. Nobody knew of an implementation that deals with this multiplicity of perspectives well. Something text-based has trouble dealing with time, for example. Participants noted that TEI starting to deal with this issue now, bt it's certainly a difficult problem.
A participant wondered what would happen if we were to just put images (or dirty OCR for typewritten originals, certainly not all of our stuff) up and mark it up later? This would be the “more product less process” approach currently in favor in the archival world. It would also be in keeping with current efforts to focus on unique materials and special collections rather than mass-produced and widely held materials.
Participants wondered if Google Book Search and HathiTrust do TEI markup. Nobody in the room new for absolute certain, but we didn’t think so.
The session concluded with a final thought, echoing many earlier conversations by this group: could crowdsourcing (user contributed efforts) be used as a means to help get the markup done?