Thursday, June 5, 2008

Summary of MDG session, 5-27-08

Article discussed: Hagedorn, Kat, Suzanne Chapman, and David Newman. (July/August 2007) "Enhancing search and browse using automated clustering of subject metadata." D-Lib Magazine 13, no. 7/8. http://www.dlib.org/dlib/july07/hagedorn/07hagedorn.html

The session began with a brief explanation of the methodology employed by this experiment and the OAI-PMH protocol, as it may not have been clear to those who don’t deal with this sort of technology on a regular basis. After this introduction, discussion moved to wondering why the Michigan “high-level browse list” was chosen for grouping clusters, rather than a more standard list? The group realized the value of a short, extremely general list for this purpose, and noted our own Libraries use a similar locally-developed list. Most standard library controlled vocabularies and classification schemes have far too many top terms to be effective for this sort of use. It was noted that choosing cluster labels, if not the high-level grouping, from a library standard controlled vocabulary would promote interoperability of this enhanced data.

The question of quality control then arose: the article described on person performing a quality check on the cluster labels – this must have been an enormous task! The article mentioned mis-assigned categories that would have been found with a more formal quality review process. Have they thought about how they would fix things on the fly – features like “click here to tell us this is wrong”? Did the experiment designers talk to catalogers or faculty as part of the cluster labeling process? Who were their colleagues they asked to do the labeling?

Is their proposal to not label the clusters at all, but to just connect to the high-level browse categories a good on? The group posited that the high-level browse used the campus structure of majors, rather than not organizational structure of the university. (This is the way the IU Libraries web site is structured). In this case, the subcategories more meaningful than main categories, so at least this level would likely be needed.

The discussion group noted evidence of campus priorities in the high-level browse list, for example that the arts and humanities seemed to be under-represented and lumped together while the sciences received more specific attention. Did this make a difference in the clustering too? As noted in the article, titles in the humanities can be less straightforward than in other discipline, making greater use of metaphors. What do the science records have that humanities records don’t? Abstracts, probably – anything else? Perhaps it’s just that the titles were more specific. Do science subject headings contain more information? Description in humanities collections might be more varied than the language in sciences? Many possibilities were presented but the group wasn’t sure which would really affect the clustering methodology.

The group then wondered if the humanities/sciences differences noted in this article would show up in a single institution, or was it just caused in OAIster because of the fact that different data providers tend to focus on one or the other and the difference is really between data providers rather than between disciplines. The group noted (as a gross generalization) that humanities tend to be more interested in time period, people, and places, whereas the sciences are more interested in topic.

Would the clustering strategy work locally ad not just on aggregations? The suggestion in the article that results might improve if run just on one discipline at a time suggests it might. In this case, clusters would likely be more specific. Perhaps an individual institution could employ this method on full text, and leave running it on metadata records alone to the aggregators. It would be interesting to find out if there’s a difference in effectiveness of this methodology on metadata records for different formats, for example, image vs. text.

The group noted the clustering technique would only be as good as the records from the original site. What if context were missing? (the “on the horse” problem) Garbage in, garbage out, as they say. We understood why the experiment only used English-language records, but it would be interesting to extend this.

The clustering experiment was run using only the data from the title, subject, and description fields. Should they use more? Why not creator? This is useful information. Was it because clusters would then form around creators, which could be collocated using existing creator information? The stopword list was interesting to the group. It made sense why terms such as library and copyright were on it, but there are resources about these things, so we don’t want to artificially exclude them. What if the stopword list were not applied to the title field?

The discussion group wondered how these techniques relate to those operating in the commercial world. Amazon uses “statistically improbable phrases” which seems to be the opposite of this technique – identifying terminology that’s different rather than the same between resources. What about studies comparing these automatic methods to user tagging? No participants knew of such a study in the library literature, but it was noted there might be information on this topic in the information retrieval literature. It would be interesting to compare data from this process to the tags from users generated as part of the LC Flickr project.

The article described the overall approach as attempting to create simple interfaces to complex resources. Is this really our goal? We definitely want to collocate like resources. The interface in the screenshots didn’t seem “Google-style” simple. The group noted that in the library field many believe simple interfaces can only yield simple answers and that people looking with simple techniques are generally just looking for something rather than a comprehensive research goal. This article doesn’t have in its scope a discussion as to whether this is true. One big problem is that the article never defines its user base, ad different user bases employ different search techniques.

The discussion group believed that browseability, as promoted by the clustering technique, is a key idea. With a good browse, the interface can provide more ways to get at resources, and then they are more findable. Hierarchical information can be a good way to get users to resources. With the experiment described in this article, the hierarchy is discipline/genre. Would retrieval improve if we pulled in other data from the record to do faceted browsing? Would this work better for humanities rather than science? Do we need to treat the disciplines differently?

Discussion group participants noted that “this isn’t moonwalking,” meaning that this technique looks promising. It needs some tweaking, but the technique hasn’t promised the moon – it’s not purporting to be a be all, end all solution. Its just something we can do, as one part of the many other techniques we use. Can a simple, Google-style interface eventually work for intensive research needs on this data? Or should it? Should the search just lead them to a seminal article and then they citation chase from there? These are interesting questions.

The group then wondered if the proposal to recluster only every few years was a good one. They would certainly need to do it when getting big new chunks of data that are dissimilar to what’s already in the repository. A possible method would be to randomly test once per month to see if clusters are working out well.

The session ended with some more philosophical questions. Why should services like OAIster exist at all if Google can pick these resources up? Is this type of services beneficial for resources that will never get to the top of a Google search in their native environments? What would happen if one were to apply these techniques to a repository with a more resource-based rather than subject-based collection development policy?

1 comment:

Kat said...

I'm thrilled that our paper was a topic for discussion!
I thought I'd provide some comments, more explanatory than anything else.

We chose the High-Level Browse because it is a local instantiation of a classification, as you noted. One simple classification used locally means we can integrate and share more easily in-house. That was the hope at least-- that if we used this classification, we'd be able to integrate our system better inside the library. We haven't gotten a chance to test this to date for multiple reasons, one of which is that HLB will likely be modified soon.

The quality checking we did was not in fact an enormous task. There were 500 clusters generated, and many of these were very clearly "junk" topics, i.e., topics for which we couldn't generate a clear word/phrase label. I pulled together 5 folks from our digital library department to help me "junk" topics and come up with labels for the rest. In point of fact, this was so much fun for my colleagues, I didn't even have to nudge them once to finish! My colleagues all had different subject backgrounds, so we covered the basic large subject disciplines. We didn't have a chance to talk to catalogers, much as we would have wanted to (and if we didn't say that in the paper explicitly, we should have) because our time limit was extraordinarily short. We performed everything you read in literally 2 months time.

We did label the clusters, plus we connected the clusters to both the top level HLB categories and the second-level HLB categories. Apologies if this wasn't clear-- see Figure 9 for more description.

We thought about using creator as a field for use in the experiment, but the clustering around creator, as you mentioned, was going to be problematic for us so we decided against it.

I believe that "simple interface" was our vain hope that OAIster could someday become that, whether that is Google-like or not. (Perhaps we're on our way now, but more on that later...) Clustering was an attempt to create a simple way into a complex system. In terms of our user base, I believe we've always assumed it is for researchers and scholars, but that those researchers and scholars could be tweens. It has been extremely difficult to gather this information to date.

And last, but not least, we just did some experiments to determine how much of our data is in the Google search index. Not as much as you'd expect-- about 30%. We're refining our experiments before we publish (likely in DLib this fall), but I think this points to the need for aggregations like OAIster.

Cheers,
-Kat