After a bit more testing by some folks on the vufind-tech list, I think the concensus is that we’re going to work with Solr’s DirectUpdateHandler with a DocumentBuilder to construct entries for the index in memory. Once I got some of the more annoying bugs out of the way, folks were quite pleased with the speed in which they were able to create the index. Now, on to the business of writing JUnit tests, field customization, and some refactoring.
Posted tagged ‘vufind’
This is the first of a couple of posts I’ve been meaning to write. There have been a flurry of posts on the Vufind lists about errors when creating the Solr index and the speed. I did about 1.8 million in 10 hours using the PHP script, considering these are getting sent across an HTTP connection, I thought this was pretty good.
Anyway, I had had the thought that using the EmbeddedSolr class to directly write to the Solr index would be faster, but before the thread developed, I hadn’t put in much into it. This week I got motivated and started working with the implementation.
Essentially this program uses marc4j to skip the conversion from a marc record to marcxml using yaz-marcdump while making the creation of the index faster. The essential flow is to first read in a marc file, open a direct connection to the Solr instance, write a marc xml record to disk, then write the same record to the index. I first did this with the EmbeddedSolr and essentially mapped each field in the marcxml file to its corresponding index field for the Solr index. While not 100% finished, I was really pleased with the speed results. I was able to index 10,100 (I wanted at least one autocommit from Solr in there) in less than 2:00 (I averaged about 1:45).
However, there is some differences in how marc has been implemented as noted by the folks on the list. I thought that the easiest way to deal with this would be to just use the XSLT stylesheet as the “rules” for transforming the marcxml. This way, if you needed to change the unique id for your resources, you just got to the XSLT and change which field is getting called out. I figured this would be a bit slower, but I wasn’t prepared for how MUCH slower it was!
First, a note about how I did this…
I used the DirectSolrConnection (at the bottom of the EmbeddedSolr page) and a RequestHandler to the solr.XmlUpdateRequestHandler in the solrconfig.xml file
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
Unfortunately, marc4j’s conversion process requires an
OutputStream to write to, so I created a
ByteArrayOutputStream to hold the generated XML and used its toString() method to create a new request to solr to add the record to the index.
For the same 10,100 records, using this second method, the time hovered around 22:00 to index! I was a little shocked that it was this different. Because of this difference, I thinking I need to come up with a better method to allow folks to customize which fields in their marc records map to the different fields in the index.