After a bit more testing by some folks on the vufind-tech list, I think the concensus is that we’re going to work with Solr’s DirectUpdateHandler with a DocumentBuilder to construct entries for the index in memory. Once I got some of the more annoying bugs out of the way, folks were quite pleased with the speed in which they were able to create the index. Now, on to the business of writing JUnit tests, field customization, and some refactoring.
Archive for the ‘java’ category
This is the first of a couple of posts I’ve been meaning to write. There have been a flurry of posts on the Vufind lists about errors when creating the Solr index and the speed. I did about 1.8 million in 10 hours using the PHP script, considering these are getting sent across an HTTP connection, I thought this was pretty good.
Anyway, I had had the thought that using the EmbeddedSolr class to directly write to the Solr index would be faster, but before the thread developed, I hadn’t put in much into it. This week I got motivated and started working with the implementation.
Essentially this program uses marc4j to skip the conversion from a marc record to marcxml using yaz-marcdump while making the creation of the index faster. The essential flow is to first read in a marc file, open a direct connection to the Solr instance, write a marc xml record to disk, then write the same record to the index. I first did this with the EmbeddedSolr and essentially mapped each field in the marcxml file to its corresponding index field for the Solr index. While not 100% finished, I was really pleased with the speed results. I was able to index 10,100 (I wanted at least one autocommit from Solr in there) in less than 2:00 (I averaged about 1:45).
However, there is some differences in how marc has been implemented as noted by the folks on the list. I thought that the easiest way to deal with this would be to just use the XSLT stylesheet as the “rules” for transforming the marcxml. This way, if you needed to change the unique id for your resources, you just got to the XSLT and change which field is getting called out. I figured this would be a bit slower, but I wasn’t prepared for how MUCH slower it was!
First, a note about how I did this…
I used the DirectSolrConnection (at the bottom of the EmbeddedSolr page) and a RequestHandler to the solr.XmlUpdateRequestHandler in the solrconfig.xml file
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
Unfortunately, marc4j’s conversion process requires an
OutputStream to write to, so I created a
ByteArrayOutputStream to hold the generated XML and used its toString() method to create a new request to solr to add the record to the index.
For the same 10,100 records, using this second method, the time hovered around 22:00 to index! I was a little shocked that it was this different. Because of this difference, I thinking I need to come up with a better method to allow folks to customize which fields in their marc records map to the different fields in the index.
Stumbled across a great resource for the different options for JVM tuning the other day…A Collection of JVM Options. Definitely worth bookmarking if you ever need to do some Java tuning!
There was some discussion on the Vufind about moving from Tomcat to Jetty. I first wanted to see if it was possible to run this so I got the latest nightly build from Solr to see which packages were needed to run the server. I then grabbed the latest Jetty (6.1.5) since the version in Solr’s build was 6.1.3. I packaged the same files that were in Solr’s distribution and dropped Vufind’s schema and config file into Jetty and fired it up. Voila…it worked like a champ.
The thing I really wanted to know is if this Jetty version would perform in a similar fashion to Tomcat. What I did to test was set up two visualized servers on the same box. Each were set up with the exact same hardware (2GB RAM, 1 processor, bridged 1GB network, running Ubuntu 7.04 server). I also used the same Java tuning on both machines (“
-server -Xmx1024m -Xms1024 -XX:+UseParallelGC -XX:+AggressiveOpts“). The only difference between the two was that one ran Tomcat and the other Jetty.
For the test, I indexed our library’s 1.8+ million catalog records on both machines which both chewed through the records in about 9 hours. To do the actual testing, I used JMeter to query both systems at the same time using a few scenarios that I thought might possibly be “real.”
In the first test, I sent 10 users with 100 queries for the book title “Flashman” to see what happened. I was pretty impressed with the results:
You know, we might get a few more users than just 10 at a time, so I ramped it up to 100 users doing 10 queries. Again, there wasn’t much of a difference.
Now to really ramp things up with 100 users doing 100 queries
And, just for kicks, 1000 users with 10 queries
With median results within a millisecond of each other, Andrew went ahead and swapped out Tomcat in favor for Jetty for its smaller footprint. I have to say that any time I’ve needed to do anything with JSP, I’ve opted to go with Tomcat. More because I know the name, but I think I’m going to keep Jetty on my list from now on! I want to take a closer look at their ANT and Eclipse Plugins!
Had a few more notes on running VuFind.
Something that is generally looked over when setting up a Java application is tuning Java. This can be a very daunting endeavor as you generally see tutorials that reference things like interpreting p-values and power analysis. However, if you’re just wanting to set an application up, this is a much larger investment of time and effort that is really needed. So, here are some things you probably want to do.
To set the Java ergonomics for server applications, you simple set a new environmental variable. For Tomcat, this is the
CATALINA_OPTS. For development boxes, I tend to make these global variables, but as long as the user account that’s running VuFind’s Tomcat instance has
CATALINA_OPTS defined, you’ll see the performance boost.
For those who can’t wait, this is what I set for my instance in a visualized instance of Ubuntu server (Feisty) that runs with 2 GB RAM and a dedicated dual-core x86_64 processor.
CATALINA_OPTS="-server -Xmx1024 -Xms1024 -XX:+UseParallelGC -XX:+AggressiveOpts"
I don’t have any heuristics on the improvement, but it is a noticeable difference in both speed and processor utilization.
Without attempting to rehash the nitty-gritty of the ergonomics of the JVM, you’re bascially telling Java to act as a server, use a statically sized heap (the memory allocated for object storage), uses young-generation garbage collection (it divides garbage collection across processors), and turning on point release performance optimizations.
For more info on setting up the JVM to be “server-class”, check out the Java Tuning White Paper. While this paper specifically refers to the Java 5 platform, these same options will work if you’ve deployed under Java 6.
I’m a bit more awake now so I will add a few more details for Day 2.
Tuesday was also the day of the System Administration Sharing Session. There was a spirited discussion about the java workflows client. Aspects of the use of the java client are enough to chill the soul of a system administrator. I can’t find my notes on the session but here is what I recall:
- The java WorkFlows client uses Sun java runtime which is updated frequently. Sometimes the upgrade breaks Workflows. apparently Sun recently issued an update with major changes and didn’t tell anyone. SirsiDynix is talking with Sun about finding a way to get a heads-up on major changes.
- Where the C client handled printing with a dump to lpt, Java requires that printing go through Windows print drivers. Consequently, printing is slower. Frequently it is much slower. The latest print drivers don’t always work better. Not good for Circulation activities.
- Under MS Vist, java updates can only be installed by administrator. This doesn’t mean someone in the administrator group, it must be administrator. Yikes! This means every staff station in the library will have to have java updated by someone from LIT .
- Under Vista, Microsoft removed the way it handles help files. While a C client will work on Vista, help won’t be available. Yikes again! The yellow question mark is our friend. this means that new staff PCs delivered with Vista will require the java client.
- The java client doesn’t support the “house” and binoculars” searches. There is a multi-line Item Search and Display wizard.
- The java client runs slow.
The last event of the day for me was the API Sharing Session and API After Hours. Being the API forum Moderator I got to run these two events. The Sharing Session is mostly administrative where we discuss enhancements, procedures, etc. I gave some tips to the newly minted APIers. I also made the plea that we get enhancements posted to the forum earlier so that we would have time for discussion prior to voting. We also briefly discussed the value of an API wiki which everyone agreed was a fine idea if we can work out security details. The System Administration wiki, which I co-manage, is proving successful and we feel that an API wiki will prove equally valuable. I will work with SirsiDynix to explore this possibility.
The API After Hours (aka API Mini-Summit) is more of a birds-of-a-feather session where the die-hards stick around and talk API. Charles S. demonstrated several custom java reports written for his school system. Pretty nifty. Among other benefits, the reports give us good examples of coding particularly tricky parts of java client reports.
Shawn C. reprised a presentation she gave earlier but with more technical details that appealed to the APIers. Shawn figured out how to do a variety of My Account activities outside of the OPAC including social software/networking tools such as tagging, book reviews, shared book lists, and RSS feeds. She also demonstrated how she was able to created staff functions to give staff access to normally client-based activities. Oooh! Bright & shiney. Must have.finally, Joel H. and David B. volunteered to have questions thrown at them in an attempt to stump the experts. There were not many questions (mostly because I neglected to tell anyone we were doing it) but they were not stumped.