The Catalog Under Scrutiny – Part 1, a look at the OPAC
In a recent article in his online column — CNN Money, Business 2.0: Future Boy — ChrisTaylor looked at computer scientist/inventor/futurist Ray Kurzwell's predictions. Kurzwell, inventor of the flatbed scanner, believes that, at the rate technology is accelerating, 20,000 years of progress will be packed into this century. We are on the verge of true artificial intelligence. You can read the entire article here: No aging, robot cars – and radical business plans
So why am I starting this blog post about the catalog with a reference to an article in which someone says that we will have robot driven cars, nano technology and artificial intelligence by the 2020s? As I've been reading about the state of the library catalog, I have been wondering if I will see a significant change in the way we deliver information before I have to hang up my system administrator hat. Conclusion: cautious pessimism with a modicum of wistfulness. Perhaps the technological cavalry is just over the hill.
A lot has been written in blogs and institutional reports about the state of the library catalog and its public face, the OPAC. This post came about in part because I was having difficulty remembering what I read and where I read it, so I decided to start a review where I would have one place to go for links.
Part 1, Relevance Rank or the Lack of It, discusses the importance of relevance ranking in search results.
Our Unicorn ILS does support relevance ranking. It calculates using term frequency — the number of instances of the search term in indexed fields of the record. The higher the number of occurrences of the search term, the higher the relevance ranking. The main reason we have it turned off is because the sorting of the set of retrieved records is done on the server. Sorting is restricted by the search sort setting. If the number of hits exceeds the limit, relevance ranking is disabled. The search sort limit is designed to avoid a performance hit on the server. So, if we were sure that our users would have results of 800 or fewer records we’d have no problem. Alternatives – 1, big honking server that can handle the load; 2, perform sort on separate server.
Part 2 of the series, The Checklist of Shame, looks at key features common to search engines but often missing from online catalogs. It is a pretty good list. Can you pick out the features that Unicorn does support?
Part 3, The Big Picture, takes a broad, conceptual look at the library catalog. She looks at the catalog in terms of once valid literalisms that are less sustainable now. For example, the first literalism is that the OPAC is still a citation index when the users expect full-text.
Karen and others point to NCSU’s new catalog interface as an example of how a user’s search experience can be improved. You can see the catalog here: NCSU's Unicorn with Endeca.
NCSU is a Unicorn site and instead of using iLink or the new web interface, EPS, they worked with the company Endeca to build a new front end to the catalog. It is undeniably nifty. The catalog data is exported to the Endeca interface nightly so you could say that you are not seeing a real-time view of the catalog. That doesn’t matter. Putting the catalog data in a product like Endeca allows them to massage the information in ways not possible with the vendor interface. For example, they put a lot of work into relevance ranking and have something such as you would expect on Google.
There is one aspect of comparing a search in a library catalog to a search in Google that bothers me. Compare the metadata from your average catalog record with what Google indexes and the library record comes up short. Especially in older cataloging, the information isn’t easily extracted. Here is an example. A history professor is interested in identifying items in the library in his areas of interest that might be suitable for scanning. He gave us the following topics for items published before 1923:
regional history, southern
economic and business history
history: labor and laboring classes
Being the senior system administrator, I delegated the project to the junior system admin, Julie. She had just completed API training and we figured that this project was going to require command line data extraction. I felt "it would be a good learning experience." Besides, she is a cataloger and thus understands the MARC records. The following paragraphs are based on Julie’s four page after-action report. Not having worn a cataloger hat in many years I found her report enlightening.
It turned out to be a challenging project and made clear the evolution that has occurred in descriptive cataloging. There were hardly any descriptive notes or tables of contents in the records and fixed field codes (e.g. autobiography) that would have been helpful were not consistently applied. The non-fiction items did contain subject headings so subject headings were felt to be the most reliable method of extracting records. Of course, most of these items were cataloged when the average number of subject headings was three.
The project required a careful analysis of LCSH to match the terms to the headings. Consider the term “southern.” What is southern? A scope note for Southern States does identify 13 states. Unfortunately, when referring to counties and cities, the state abbreviation is used and in other uses, the name of the state is spelled out. The queries had to be run for each variation. We were amused to see that the abbreviation for Arkansas, Ark, would also retrieve records referring to the Ark of the covenant.
There was also an unexpected consequence to the historic term Reconstruction not being unique to the United States, or even with the idea of post war history. This required combining the term with other terms that would eliminate the results of the construction industry, etc.
Each of the terms that the professor gave us had to be tested to find the best matches in LCSH. I think it would be safe to say that, basically, without a working knowledge of the LCSH it would not have been possible to build a bibliography for the professor.
The best search engine in the world won’t help if the information isn’t there or if it is represented in terms of which the average searcher is unfamiliar. This will be an interesting challenge for libraries as the ILS gets more sophisticated with better searching and relevance ranking.
In the next installment, I’ll take a look at what is being written about the catalog itself starting with the concept of a open-source ILS.