Metadata Discussion Group: August 2009

Saturday, August 8, 2009

Summary of MDG Session, 5-28-09

Article discussed: Smith-Yoshimura, Karen. 2009. "Networking Names." Report produced by OCLC Research.

Terminology issues discussed this month included "cooperative identies hub" (is this OCLC’s term? Yes) and API.

The meat of the session began with a discussion of the statement in the report that a preferred form of a name depends on context. Is this a switch for the library community? National authority files tend not to do it this way, but merging international files raises these same issues – they might transliterate Cyrillic differently for example. The VIAF project is having to deal with this issue. The group believes Canada, where this issue likely is raised a great deal due to its bilingual nature, just picks one form and goes with it.

Context here could mean many things: 1) show different form in different circumstances, 2) include contextual info about the person in the record, etc. For #2, work going on right now (or at least in the planning stages) trying to minimize how much language specific stuff goes into a record. One could then code each field by which language is used in the citation. Library practice includes vernacular forms in 400 fields now – these could then be primary in another language catalog. But the coding doesn’t yet distinguish which is really a cross-reference, and which will be preferred in some other language. So this might not be as useful as it would seem at first.

To achieve the first interpretation of flexible context, a 400 field in an authority record can no longer mean “don’t use this one, use the other one”. Purposes of authority file now: for catalogers to justify headings, for systems to automatically map cross-references. Displaying a name form based on context is definitely a new use case for these authority files.

Participants wondered if we add more information to our authority files to make them useful for other purposes, how do we ensure they still fulfil their primary purpose? Should our authority records become biographies? The group reached basic consensus that adding new stuff to these reocrds won’t substantively take away from the current disambiguation function.

The group then turned to privacy issues raised by the expanding functions of authority files. One individual noted that in the past, researchers went to reference books to find information on people. Has this information moved into the public sphere? An author's birth date is often on the book jacket. Notable people are in Wikipedia. The campus phone book is not private information. In the archival community, context is everything. Overall, the group felt we didn't need to worry too much about the privacy issue – for the most part functionality trumps the privacy issues. We still need to be careful but it looks like we're looking at an evolution of what we think of as privacy. We no longer think privacy = public but not easy to get to. Privacy and access control by obscurity is no longer a viable practice. One solution would be to keep some data private for some period of time. Some authors don’t want to give birth date, middle initial for privacy issues, and might respond better to a situation where this is stored but not openly public. Participants noted that this additional data is generally not needed for justification of headings or cross-references. But with expanded functionality, we'll need expanded data.

One participant wondered almost rhetorically if the authority file should be a list of your works, or also a list of your DUIs? In the archival community especially, the latter helps understand the person. How far should we go?

The Institutional Repository use case was the next topic of discussion. When getting faculty publications, it would help to expand the scope of the authority file at the national level. But at local level, many are already struggling with these issues. Do A&I firms do name collocation? Participants don’t think so.

Participants then wondered about the implication of opening up services to contributions from non-catalogers. Some felt we needed to just do it. Others thought opening to humans was a good idea but buggy machine processes could cause havoc. Even OCLC has a great deal of trouble with batch processing (duplicate detection, etc.), and they’re better at this than anyone else in our sphere. For human edits, the same issues apply as with Wikipedia, but our system don't get as many eyes. What is the right nodel for vetting and access control? Who is an authorized user?

Participants believed we need to keep the identities hub separate from the main name authority file for a while to work out issues before expanding the scope of the authority file significantly. The proposed discussion model in the report (p. 7) will help with the vandalism issue. The proposal flips the authority file model on its head, with lots of people adding data rather than just a few highly trained individuals. A participant wondered if the NAF in the end becomes the identities hub. Maybe the NAF feeds the identities hub instead.

Discussion then moved to the possibility of expanding this model to other things beyond names. Geographic places might benefit, but probably not subjects - this process contradicts the very idea of a controlled vocabulary. One participant noted a hub model could be used to document current linguistic practice with regards to subjects.

The session concluded with participants noting that authority control is the highest value activity catalogers do. The data that’s created by this process is the most useful of our data beyond libraries. We need to coordinate work and not duplicate effort.

Summary of MDG Session, 4-23-09

Article discussed: Nellhaus, Tobin. "XML, TEI, and Digital Libraries in the Humanities." portal: Libraries and the Academy, Vol. 1, No. 3 (2001), pp. 257-277.

The conversation began by addressing terminology issues, as is typical of our Metadata Discussion Group sessions. "Stylesheet" and "boilerplate" were among the unfamiliar terms. One participant noted “expanded linking” is like the “paperless society” - a state much discussed but never remotely achieved.

Discussion of the points in the article began with a participant noting that the TEI is a set of guidelines rather than an official "standard." Is this OK? Can we feel safe using it? The group believed that if there is no “standard” for something then adopting guidelines is OK, with the idea that something better than nothing. Is there a standard that would compete with TEI? Docbook is the only real other option, and it hasn't been well adopted in the cultural heritage community. Participants wondered if the library commuity should push TEI towards standardization.

An interesting question then arose wondering if the TEI's roots in the humanitites made it less useful for other types of material. The problems with drama described in this article would extend to other formats too. What about music? How much should TEI expand into this and other areas?

Discussion at this point moved to how to implement TEI locally. Participants noted that local guidelines are necessary, and should be influenced by other projects. Having a standard or common best practices is powerful but that still leaves lots of room for local interepretation. Local practice is a potential barrier to interoperability - for example, a display stylesheet won’t work any more if you start using tags that aren’t in the stylesheet. Local implementations have to plan ahead of time for how the TEI will be used. In the library community, we create different levels of cataloging – encoding could follow the same model. Participants noted that we should do user studies to guide our local implentations.

The group performed an interesting thought experiment examining the many different ways TEI could be implemented, considering Romeo & Juliet. Begin with a version originally in print. Then someone typed it into Project Gutenberg so it was on the Web. Then someone figured out they needed scene markers so someone had to go back and encode for that. New uses mean we need new encoding. How do we balance adding more value to core stuff rather than doing new stuff? A participant noted that this is not a new problem - metadata has always been dynamic. The TEI tags for very detailed work are there, which makes it very tempting to do more encoding than a project specifies. Take the case of IU presidents' speeches. Do these need TEI markup or is full-text searching enough? It would be fascinating and fun to pull together all sorts of materials – primary, secondary, sound recordings of his speeches. But where is this type of treatment in our overall priorities?

A participant asked to step back and ask what can TEI do that full text searching can’t do. Some answers posed by the group were collocation and disambiguation of names, date searching for letters, pulling out and displaying just stage directions from a dramatic text.

We then returned to the notion of drama. It's hard to deal with plays both as literature and as performance. Is this like us treating sheet music bibliographically vs. archivally? Here the cover could be a work of art, music notation marked up in MusicXML, and text marked up in TEI. Nobody knew of an implementation that deals with this multiplicity of perspectives well. Something text-based has trouble dealing with time, for example. Participants noted that TEI starting to deal with this issue now, bt it's certainly a difficult problem.

A participant wondered what would happen if we were to just put images (or dirty OCR for typewritten originals, certainly not all of our stuff) up and mark it up later? This would be the “more product less process” approach currently in favor in the archival world. It would also be in keeping with current efforts to focus on unique materials and special collections rather than mass-produced and widely held materials.

Participants wondered if Google Book Search and HathiTrust do TEI markup. Nobody in the room new for absolute certain, but we didn’t think so.

The session concluded with a final thought, echoing many earlier conversations by this group: could crowdsourcing (user contributed efforts) be used as a means to help get the markup done?

Summary of MDG Session, 3-28-09

Article discussed: Allinson, Julie, Pete Johnston and Andy Powell. (January 2007) "A Dublin Core Application Profile for Scholarly Works." Ariadne 50.

As usual, the discussion group session began with time to talk about unfamiliar terminology or acronyms/terms from the article that were not fully explained. This month, JISC, UKOLN, and OAI-PMH were covered in this question period.

We then moved to a discussion of whether the list of functional requirements described by this article is really the right one. Topics covered included:

Should preservation be on the list? Governments and libraries generally have this as a requirement? Do researchers? The group believed that overall, yes, researchers need preservation but don’t understand it as such - they just want to find something they've seen again later.
Multiple versions. Preprint, edited version, publisher pdf are all available and need to be managed. But maybe we don’t need to keep them all, but just tell users which they’re looking at. The NISO author version standard out there (in draft maybe?) is setting up a common terminology for us to use. Its important to archive these things. One possible solution: keep final version, and track earlier versions more as personal papers.
What about earlier work products like data sets and excel spreadsheets? How much in-process work can/do we want to save? Data sets could be used by many different publications. We would need to make sure users can get to the final writeup easily without getting bogged down in the preliminary stuff. Managing these earlier work products would shift the focus from the writeup to the researcher. Both are important, maybe we deal with them in separate systems. One member brought up Darwin’s Origin of the Species – the text is online and you can see earlier drafts, how the work evolved over time. The work process has long been an interesting area of research that we could promote more. However, it raises issues of rights management, author control. Should we allow the researcher to be in control of deciding what to deposit? We'd have to have them choose while they’re alive.
Unpublished works have different copyright durations. Are things in the IR published or not? Is a dissertation published or unpublished? Does placing it in a library “publish” it? AACR2 thinks dissertations are unpublished. But does copyright law?
What about peer reviewed status? Is the peer reviewed/non peer reviewed vocabulary we tend to use now good enough? Are there things in the middle? Early versions of a paper won’t have gone through the peer review process, and we need to track that one version is peer reviewed and another is not. Individual journal titles are peer reviewed or not so we can guess a paper's status based on that. But in general we would probably want to get this information from the author - there are columns, etc in peer reviewed journals that aren’t actually peer-reviewed.
Participants noted the requirement to facilitate search and browse - not many of our IR systems now do this all that well!
A participant asked if we should be providing access to these types of material by journal/conference? It’s duplicating work that others do. But for the preservation function this information is important.
Our functional requirements discussion wrapped up with participants noting that “cataloging” in these repositories doesn’t look like cataloging in our OPACs. Is this difference going to bite us later? This article describes data an author could never create. The authors obviously have decided cataloger-created data is worth the time and effort. It would be interesting to hear the rationale behind this decision.

We then turned to discussing the minimum data requirements described in the article.

Some of the minimum requirements seem very high end and difficult to know.
Participants wondered which attributes are listed don’t apply to large numbers of scholarly works. The following were identified: has translation, grant number, references.
The group then wondered if the authors had to be so flexible with minimum metadata requirements to allow authors to deposit their own material. Why wouldn’t authors want to do this? Time and effort seem to be big barriers. Even figuring out what version to deposit takes more time than most researchers care to spend.
Participants wondered how effective OA mandates are. In discussion, it was noted that they don’t make it any easier to deposit, and researchers might think it's still not worth their time. A prticipant quoted data from one scientific conference that said if you publish with them you have to provide all your data. 50% provided the full data. 20% uploaded an empty file just to meet the “upload something” requirement!
Conclusion: better systems are a key to actually collecting and saving this stuff.

The discussion moved to pondering how author involvement in the archiving process is a fundamentally new requirement. We never asked researchers to deposit papers in the University Archives before. How do we decide what’s worth keeping? Should we really preserve all of this stuff? How do we get people to the right stuff? Is this a selection/appraisal issue or a metadata issue? Our final conclusions were that the model described in this article helps with creating more functional systems but doesn’t help with making the system easier to use. Minimum requirements for deposit might just be a first step, but to achieve our greater goals that data would likely need to be enhanced later.

Metadata Discussion Group

Saturday, August 8, 2009

Summary of MDG Session, 5-28-09

Summary of MDG Session, 4-23-09

Summary of MDG Session, 3-28-09

Blog Archive

Contributors