Inside Skiviez: Catalog Listings and Search Queries

Last Tuesday, I rolled out a new release of the Skiviez Web site. It is the most significant technical change at Skiviez since 2007 and brings with it a lot of improvements in how it operates against our company’s 10-year-old database.

One of these new features is a fast, faceted product catalog. Faceting is the technical name for something you’ve seen in many e-commerce stores like NewEgg, Zappo’s, and Lowe’s: a list of links that quickly allow you to drill down to the content that you’d like to see.

Trying to do this kind of thing efficiently in a traditional relational database is not going to happen unless you want to start adding and removing indexes on the fly or go full on Entity-Attribute-Value, which somewhat defeats the purpose of using a relational database. You lose efficient indexing, you lose type information, you lose, well, the relations.

Instead, Skiviez now uses Solr–an open source search platform that is based on Lucene, a text search engine library–for managing catalog listing pages and search pages.

Solr from 10,000 feet

Solr is not relational. It doesn’t have tables. The schema is flat; all “documents” in the Solr store share the same schema. From a relational perspective, you could think of it as one table with many columns, except that not every row in the table uses all of the columns. (This isn’t how it internally works at all, but it’s a useful metaphor.)

Some fields are “stored,” which means you can get back the data that you put in. Some fields are “indexed,” which means that you can query on them directly and use them for faceting. And some fields are “multivalued”, which means that they can hold more than one value at a time. For example, our <sizeInStock> field can hold S, M, and L values all at once for a particular product, and just L for another.

Why would you ever want a value to not be “stored”? Why add data that you can’t get back? Well, in Solr, storing and indexing are distinct concepts, so you can a value that is not stored, just indexed. A good example from the Skiviez Solr schema would be the salesVolume field, which indicates how many units of a particular item were sold in the past week:

		<field
			name="salesVolume"
			type="sint"
			indexed="true"
			stored="false" />

This is just an integer, a number of units sold. In my UI, I’m not going to ever actually display that number–that would reveal a little too much information about our business. But I still want our users to be able to sort catalog listings by “Top Selling” products. By creating an index, we enable sorting; by disallowing storage, we save some space.

The Skiviez Solr schema defines a “type” field, some fields that are shared among all document types, and a few type-specific fields. This works well because the only intended use of Solr is catalog listings and searching; if I had to store more disparate kinds of data, I might create several Solr schemas.

<!--
	This field is used to indicate the type of data that is stored within
	the document. Since the Skiviez Solr instance represents a variety of
	different types, this field helps to indicate what kind of data you are
	looking at.
 
	Types that are currently understood the Web site include:
 
		BLOGARTICLE
		BRAND
		LINE
		PRODUCT
		PRODUCTSUMMARY
                PROMOTION
		STYLE
		WEBPAGE
-->
<field
	name="type"
	type="string"
	indexed="true"
	stored="true"
	required="true" />

Why is this useful? Well, the search portion of the Skiviez Web site can use the <type> field and the fields that are used by all of the different document types to quickly search a wide variety of sources:

In this case, we’re looking at our <type>, <urlKey>, <name>, and <image> fields. The Web site can use this to quickly sort information and build links to content.

The PRODUCT type uses a wealth of product-specific fields, however, and when we query to display catalog pages, we tell Solr to only return documents of type PRODUCT. I could have created a second Solr schema for this purpose, but why bother?

Getting data into Solr

The traditional SQL database is still the “authoritative” data store, and the Solr indexes are read-only snapshots of that data. This means that the data coming from Solr is always slightly stale, so I had to ask myself:

  • How stale is too stale?
  • When do I value speed or querying over staleness?

Part of the new Skiviez Web site is a Windows service that I call the “Worker.” It uses Quartz.NET to execute C# IJob implementations periodically. You can think of them as traditional scheduled tasks in Windows; the only difference is that I am explicitly managing them in code, using the same object-oriented model of our domain in those jobs, and, as long as the service is installed, I don’t need to worry about configuring scheduled tasks.

Every three hours, one of those IJobs that gets executed is the RefreshSolrIndexesJob, and all that job does is ping an HttpWebRequest over to http://solr.example.com/dataimport?command=full-import, where solr.example.com is placed with the FQDN of our internal Solr server. This is because I use Solr’s built-in DataImportHandler to actually suck in the data from the SQL database; the job just has to “touch” that URL periodically to make the sync work. Because the DataImportHandler commits the changes periodically, this is all effectively running in the background, transparent to the users of the Web site. And because the Skiviez product catalog is reasonably small (a few thousand items), we can blow away the whole Solr index and re-build it in fewer than two minutes.

There’s also a function in our backend application that allows employees to trigger the index rebuilding immediately; this can happen when new product arrives in the warehouse and we want to get it up on the Web site right away.

The DataImportHandler is built into Solr, and configuring it is a little confusing because it uses some strange terminology. It just takes an XML configuration file, and whenever you ping its request handler, it performs synchronization tasks based on what has been specified in its configuration.

      <entity
        name="brands"
        dataSource="undiesDatabase"
        transformer="TemplateTransformer"
        query="
            SELECT
                b.ID AS BrandId,
                b.[Name] AS BrandName,
                b.Description AS BrandDescription,
                b.UrlKey AS BrandUrlKey,
                CASE b.Active
                  WHEN 'Y' THEN 'True'
                  ELSE 'False'
                END AS BrandIsActive
            FROM na.brands AS b
            ORDER BY b.[Name];">
        <field column="id" template="BRAND-${brands.BrandId}" />
        <field column="type" template="BRAND" />
        <field name="identity" column="BrandId" />
        <field name="name" column="BrandName" />
        <field name="description" column="BrandDescription" />
        <field name="isActive" column="BrandIsActive" />
        <field name="urlKey" column="BrandUrlKey" />
      </entity>

I say the terminology is confusing because in <field> elements, the @column attribute is the name of the Solr field and the name of the column in the SQL result set. But if the @name attribute is specified, then the @column attribute is the name of the column in the result set and the @name attribute is the name of the Solr field. It’s confusing because some <field> elements don’t pull directly from the result set, instead relying on a “transformer.” You would expect to just specify a @name for those fields, but specifying just @column is correct. In the above example, the “Template Transformer” is used on the id field to format Solr identities in the format of BRAND-127, where 127 is the SQL database’s primary key for the brand.

(And yes, attributes can have newlines in them. I’m not sure how I went through years of programming without realizing this, but once I realized that I could do it, it made reading those SQL queries much easier! This example is by far the shortest query for pulling Solr data from the SQL database; the one for products is over 200 lines long.)

Querying data: how stale is too stale?

This does mean that information in the Skiviez product catalog can be up to three hours stale. A user might click a link for “Medium In Stock (3)” on the catalog page (since this kind of faceted data is generated by querying Solr) but then see on the product detail page that no mediums are in stock (since on this page, the quantity information is one of the few things not cached and queried directly against the database). This is annoying, but generally rare in our particular scenario (we are a reasonably small business and not that high traffic), and it will be fixed up in 3 hours anyway when we rebuild the whole index again from scratch, so I have accepted this as a reasonable trade-off.

(If I really wanted to solve this problem, there are a few approaches that I could take. I could use domain events to fire off partial updates to the Solr index whenever a Save/Update operation occurred on an Item or ItemGroup in the Skiviez domain model, or I could insert a record into a table named, say, dbo.IdentitiesOfStuffThatNeedsUpdatingInSolr, and have an IJob that reads that list and executes partial updates every minute. And even if I did these things, I’d still probably want to do a periodic “blow it all away and refresh” in case one of those partial updates failed in the background.)

As for querying this data from Solr, there are a few approaches that I could have taken. One is to hide the fact that Solr exists entirely via the methods of a repository-like class: the Get*() methods would access Solr, and the Create/Update/Delete methods would access the database. I didn’t like this approach because my Solr schema is already shamelessly tailored to the UI that will be accessing that data, as it should be–I’ve already made the decision to use Solr to provide easy faceting, sorting, and fast display of information, so I might as well use it to its fullest extent. This means making it explicit in code as to when I mean to access Solr and when I mean to access the up-to-date, non-cached database object.

In my case, I ended up using NHibernate to do the CRUD access (e.g., loading an ItemGroup, futzing with its pricing rules, and then saving it back), forgoing the repository pattern because I don’t typically see its value when NHibernate and its mappings are already abstracting the database. (Sometimes abstracting out NHibernate is useful, however, as we’ll see below.)

When querying for data, I know pretty well if I’m using it for catalog-oriented purposes (in which I care about speed and querying features) or for displaying in a table on a back-end administrative application (I care about currency). For querying on the Web site, I have an interface called IListingQuery. It has a Search() method that accepts a ListingRequest where I define some parameters–selected facets, search terms, page number, number of items per page, etc.–and gives back a ListingResponse–remaining facets, number of results, the results on this page, etc. This interface is pretty boring stuff.

Where it gets interesting is that the implementation of IListingQuery that gets dependency injected into my Web site’s ProductsController is using a list of IListingQueryStrategys underneath. The default strategy, the SolrListingQueryStrategy, hits Solr directly via a plain old-fashioned HttpWebRequest and parses the XML in the HttpWebResponse (which is much easier to use, in my humble opinion, than some of the Solr client libraries).

If the Solr-based strategy throws an exception or vomits for some reason, then the DatabaseListingQueryStrategy runs next and hits the database directly–although it ignores some parameters of the ListingRequest, like faceting or advanced text searching, since that is inefficient to do there and is the whole reason I am using Solr in the first place. The idea is that usually Solr is answering my search requests quickly in their full-featured glory, but if something blows up and Solr goes down, then the catalog pages of the site can still function in “reduced-functionality mode” by hitting the database with a limited feature set directly. (As implemented on the Skiviez site today, you would realize that this happened if the filters list on the left-hand side becomes empty and search parameters are ignored.)

(The explosion of classes that you see in the screenshot above is mostly due to testability. For example, Solr works by sending long, complicated GET requests to it. So I’ve pulled out of the URI-building functionality into its own class; it serves a single purpose and can be tested as such. The actual SolrListingQueryStrategy class implementation is very short, with most of the work delegated to other classes within that folder.)

Conclusions and delusions

What is important is that I have made explicit in code that this is a search–by using IListingQuery instead of NHibernate–so the database-based strategy can take some liberties in ignoring some of the search parameters without worrying about affecting some of its down-level callers too severely. The decision to perform a query against a possibly-stale data store versus the authoritative data store has been made explicit–if I want fast, possibly stale data with advanced search features, I use IListingQuery. If I want slow, up-to-date data with insert/update/delete capability, I use NHibernate’s named queries. And if I make a change in the SQL database, I know that the out-of-process Worker service will update Solr eventually, making things eventually consistent.

The end result? Fast catalog pages for our customers that gracefully fall back to the old behavior when something doesn’t work.