The Codist - Programmerthink

Building My Cheap, Scalable, High-Volume Query Site

Posted: 03/28/2007, Perm Link Readers: 3668


The project I am working on at home will require an architecture to support a high volume of ajax calls slicing and dicing data from a fairly large database. Since the data is only updated once a day, there are interesting options on how to build the system.

Web applications can be roughly divided into two types: mostly read and some write. The first type is mainly used to show the user some information based on some type of query, which could be explicit (like a query by example form) or more commonly based on a UI that gives them options. The second may do the same but allows updating of information by the user. It's rare for an application to not have some update features so what type it is may be hard to pin down. I view the difference as whether the user can update the information they are querying.

Amazon, for example, has a lot of write features (ordering, building lists, commenting) but they mostly don't affect the catalog of items (other than availability).

Digg may spend most of its time showing similar content but users are actually modifying the data as they interact with it.

The site I am trying to build will be more like Amazon, in that the majority of the information only changes rarely (like once per day) and that other interaction is peripheral. This makes it easier to use all sorts of caching strategies.

When I worked out the problem in 1998 with Consumers Digest Online we were using WebObjects, which would run separate processes (called an instance) in the same server (or multiple servers). Originally my digested search information (about 20MB mostly a big tree structure and associated indexes) was loaded into each instance, which somewhat limited how many we could run in a single server box. Later we modified the cache data (which was all static data) to run inside a shared memory space (HP/UX). This architecture then looked like this:

We were able to support as many as 2000 simultaneous users with this structure, even though the search engine was fuzzy with a lot of derived data searches. The average response time through the cache for any query was subsecond. There were something like 40,000 packages of features, 20,000 products, and an almost infinite combination of potential results. You couldn't cache individual results as each search was unlikely to ever appear again. The data changed once per week, and was processed for the caches after the new database was loaded. All display data came from the database directly.

During my year of working on projects for Sabre I wrote a white paper for them on how I would build a replacement for their monster reservation system (with its 8000-per-second query requirement) based on a similar but larger architecture. I don't think it went anywhere but eventually they built something with a more traditional database caching architecture. In their case the flights don't change too often but the reservations do so it is more of a type 2.

For my new application I will be managing a wide variety of slice-and-dice searches through a much larger space (iTunes for examples has 2M+ tracks) but at least the information isn't derived from the database data like in the CDOnline system (with its complex required and disallowed package combinations). The biggest issue is that in building an Ajax from end you need fast response even under heavy loads, since queries are 95% of what the application will do.

The architecture for this system looks something like this:

The point is to replicate the cache servers which have enough RAM to store the data required for searching all in memory. As the data in the database is updated only once per day, it can be optimized for the fasted possible searching when loaded into the cache server. Of course you could build something more traditional which would allow the database to manage the caching or use a caching framework (of which there are plenty both free and for-pay).

In my case I want to manage the searching myself as most of it involves deeply nested trees, which are hard to optimize in a relational database (Oracle has some benefits here but I've never used them). RAM is always faster than hard drives (at least until the new Flash drives get cheap enough) and since I know exactly what I will allow the users to do, I can build the cache precisely for those needs.

The pleasant thing about this architecture is that I can not only grow it as needed, but I can also distribute it over multiple data centers if necessary. This is basically what all of the big companies do. I expect to be able to manage as much as 100 transactions per second, all without any massive investment in servers (money I don't have anyway).

One other note is that I use Jetty as my appserver, which is really fast for Ajax calls with its continuation based architecture.

Anand Sharma 03/30/2007 00:05

I am wondering what software/service you're planning to use to provide the "cache server" capability?

nemlah 03/30/2007 08:07

We have developed an application with the exact same requirements, and while we don't have add a caching tier to our solution so far, if growth continues as planned we will have to add it soon enough. One thing I was always wondering about is how costly it is to create the caches (if for example we need to update the main database during the day, how fast will the caches be uptodate?). The application is written in rails and the obvious choice would be memcached, but I am looking forward to see your solution.

Regards,

Nemlah

Craig 03/30/2007 11:42

I would swap out parts of your existing data layer which are querying the database for results with a layer which queries Lucene for results.

Then, simply regenerate your Lucene indexes nightly. You shouldn't need that Cache layer since I believe Lucene has mechanisms for caching frequent queries.

codist 03/30/2007 11:52

Yeah lucene is a good idea if you have common queries, especially full text ones. In my case it won't work (as in the CDOnline version) as the queries are multidimensional and rarely repeated.

Chris Lu 03/30/2007 17:57

Lucene is good, even for exact match because it's just file access. Database is simply too slow. What you cache actually did is to move data to file access.

To try Lucene, you can use DBSight. It's super easy. You can create a production-level search in 3 minutes.

Please take a look at this

http://wiki.dbsight.com/index.php?title=CreateLuceneDatabaseSearchin3minutes

You can create a full-text database search service, return results as HTML/XML/JSON. It uses the Lucene directly in java, but can be easily used with Ruby, PHP, or any existing database web applicatoins.

You can easily index, re-index, incremental-index. It's also highly scalable and easily customizable.

wpbarr 04/12/2007 16:26

You have several good options for the distributed datastore:

  • Javaspaces

  • memcached, Tangosol

  • object-oriented databases

  • in-memory databases (Intersystems Cache)

You should be able to get sub-millisecond accesss times with any of them. With any of them, you get to stick with a pure object model and, you can use either keyed or associative queries.

kjgha 04/16/2007 03:59

klh

XCoder 04/18/2007 09:04

Since nobody is talking about hubernate's caching. How bad is hibernates two layer caching mechanism?

Alex Popescu 10/06/2007 06:46

Hi there!

I have built a couple of apps using the same approach. However, for those with very high requirements I have quickly find out the the DB node can become a bottleneck. So, without wanting to sound like an advise, I would probably try to partition the data or replicate the storage. Considering that in your case the DB updates are rare? then probably the simplest approach would be storage replication.

bests,

./alex

--

.w( the_mindstorm )p.