dw_dev | Data and query structure for faceted styles search

Entry tags:

styles

Data and query structure for faceted styles search

I have been researching this, and while I've found a lot of articles on how to set up faceted searching using existing search engines (mostly Solr, though a few with Sphinx), and a bunch of articles on frontend design for faceted search.... there is not a lot on optimizing data structure or queries that I can find. And I know basically nothing about code optimization so - tossing this out here!

Data Structure - we already sort of have a faceted data structure for styles in the forms of categories. There is a great big SQL table (don't ask me which one, I don't remember) but what it boils down to is something like this (assuming more categories than we actually have live):

Style ID - Category ID
Style 1 - Color: Red
Style 1 - Accessiblity: light on dark
Style 1 - Base: Practicality
Style 2 - Color: Blue
Style 2 - Base: Modish
Style 2 - Accessibility: Muted
Style 2 - Accessibility: Dark on Light

A lot of the tutorial/design stuff seems to expect a structure where facets are split, ie:

Style ID - Color - Base Layout - Accessibility
Style 1 - Red - Practicality - light on dark
Style 2 - Blue - Modish - Muted

...which then brings about the obvious problem that not all tags within a facet will be mutually exclusive. The second one seems like it might have a slight speed advantage in querying (because queries over a single facet don't need to look at all the data), but I'm not sure how to handle multiple tags on an item within one facet (aka polyheirarchy according to the internet), and I'm not sure if the increase in speed would merit having to rewrite the entire category backend, instead of just writing a new interface for it.

Query Structure - This is the point where I am running on naive newbie programmer logic! Thus, I can think of a simple solution, but I have no idea if the simple solution is the best solution. Basically, facets are a fancy interface for constructing long AND/OR queries. So - write function for searching categories by AND, write one for searching by OR, call in sequence as necessary. I have no idea if this makes sense in the real world.

Tragically the already existing options are 'use Solr', 'use something written on a Java server backend', or 'use a client-side Javascript library (which won't work with JS off and will probably not work on a 1300+ collection, but we didn't stress-test it to see)'

Flat | Top-Level Comments Only

Found a Stack Overflow question here that linked to this book. I think it might be what you're looking for?

Also, if there is the ability for Sphinx to do this, that might be something to consider, since we do use Sphinx for other things already. (The set up might be a bit difficult for you, though.)

The Stack Overflow link isn't working :( But I will see about getting a copy of that book, it looksn like it might be handy.

This was the biggest post I found on using Sphinx for faceted searches, though apparently Ravelry uses it too. It's possible we could solve the polyheirarchy problem by having the backend facets split down into mutually exclusive catagories (dark vs light, muted vs bright, dark on light vs light on dark) and then just lump them together on the front end as 'Accessibility'

Also wow I forgot I was going to make the tables all even and monospaced. GO ME.

I have no substantive contribution to make but I wanted to say you are awesome and this is amazing.

<333333

It will be more amazing when I get it to work (and damnit I will make that a 'when' and not an 'if')

Of course you will! <3 waves pompoms

Solr is actually pretty low overhead to set up if we decide we want to use it. It wouldn't conflict with anything we already have; it's basically a search engine index which sits beside your existing infrastructure. For example, in the digital library we use at work, our old front end queried the database directly, while our new front end queries a solr index which is optimized simply for returning search index results, and that provides the links into the actual data. I'm pretty sure it works very similarly to Sphinx, but there's no reason they can't sit next to each other.

one advantage of including solr is because it comes out of the Apache lucene project, it is now getting very widely used, and a lot of open source software is being written assuming it's the backend.

Whether we stick with Sphinx or add Solr, we probably should do these queries against an index,and nothing directly against the database. (Which seems to be what you are saying as well but I just wanted to say it explicitly.)

... I'm kind of apologetic that I started you down this path since it had clearly gotten complicated. :/ (By the way, I had wanted to show you an example of facets using checkboxes so that you could have multiple selectors. If you go to the University of Georgia library catalog and search for any term, you will see that the results page has checkbox-based facets.)

pft, well, I knew it was going to be complicated going in. So you've actually been helpful in giving me the language I need to research and figure out exactly what dimensions of 'complicated' I'm looking at.

I do sort of get the idea of running against an index rather than the actual database, though I will confess I am not fully clear on what the distinction is yet.

So -- this is an oversimplification, and I'm not expert -- a dedicated index is basically a special purpose database designed for nothing more than lightning-fast retrieval of search terms. It's not a general purpose db like mysql, but instead uses [hand wavy CS stuff I'll probably have to learn in grad school] to be zippy at text-based search. So you build your search index dynamically from your db and run searches against it instead of against your primary db, giving you faster response with no db load.

(Everything I'm about to say is about solr, but may well also apply to Sphinx, I dunno.) Solr is also easier for webby people to use rather than needing DBAs to build clever SQL queries. It has REST-like APIs, so you can build the search directly in the URL, eg http://dreamwidth.org/stylesearch?q=fluffy&fq=accessibility:ON&fq=color:blue&fq=inheritance:zesty

That being said we probably don't need it for performance reasons; our set of styles is small as database datasets go.

I feel Educated! Though thankfully I am fairly comfortable with direct SQL queries as a result of working what was a essentially database maintenence at a previous job.

I poked mark on IRC earlier because he is the Master of Scaling and may have a better sense of how hardcore we need to be about efficiency - that is a department I know very little about (other than efficiency = good, but that sometimes it means trade-offs)

(Long answer so I can ramble a bit...)

Since we're talking about searching on the order of thousands of items -- ~2000 now, even in an arbitrary future with 100,000 themes... let's do some thinking about the complexity of the problem.

Each theme will have tags. Probably on the order of 10. Each tag is going to be a unique ID of some sort -- since that's how we typically do tags. This means each theme is going to have on the order of 100 bytes of data -- some tags, some ids. Even with 100,000 themes -- that's 10MB of raw data. That's really not much and easily fits in RAM.

That said, I wouldn't want to put 10MB * webservers (6) * processes (20), since that will blow up to more than a gigabyte of RAM just to store this data structure. But we know that searching is probably not going to be a top 10 activity, so this is a perfect use of a Gearman worker -- that way we get to store the data structure once or twice. We can add more Gearman workers for load.

So, for implementing faceted search, I would do it like...

1) Create a table for storing the tags that we use. It would basically contain columns for (tagid, tagname, tagdescription) and any other metadata we want about tags.

2) Create a second table that contains mappings from (styleid, tagid). Each style can have many tags, and you want indices that go both ways (tag -> style, style -> tag).

Then, the Gearman worker would start up and do SELECT * FROM style_tags and select out all of the data. It would then build an in-memory hash that basically looks like:

%TAG_HASH = (
    tagid => [ style, style, style... ],
    tagid2 => [ style, style, style... ],
)

It would then listen for requests. When someone submits a request, the Gearman worker can then do some simple set logic -- intersection for AND, union for OR. Realistically you'll be doing this on sets of at most a few hundred items (we probably want to make sure any single tag is no on 80% of our styles, f.ex.).

...

And voila. That should do it. The Gearman worker will take a few seconds to load (10MB of data has to be fetched and sorted into memory), but that's relatively quick. You should also probably make it reload every 20 or 30 minutes, just to make sure that it stays fairly fresh. (If the tags update very often. If not, adjust the frequency.)

The problem is small enough that I would say simple-and-straightforward is the best approach. And, if we ever get into a future where we have milions of themes, then we can revisit this and adjust. (But realistically, I expect by then machines will have gotten fast enough that this approach always works.)

Thoughts?

This was more or less my original idea (sans Gearman loading anyway). I'll have to check the tables but I think this is more or less how the category tables are set up already, which reduces the backend stuff in S2.pm that needs changing.

Mostly noting this here for my remembering, though potentially more food for thought - I double-checked, and the s2categories is basically table 2 in your outline - s2 layer id, category keyword id, an 'active' column (because I think right now we can deactivate categories, and I can see that being useful in the future) and... a 'has_subcats' column which I maaaaaaay have just added on my own db when I was first looking at nested subcategories (which has kind of been replaced by 'tags').

Of course, now the tricky part is figuring out where the category keyword id -> name map is....

Flat | Top-Level Comments Only

Data and query structure for faceted styles search

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject