My little project in building a virtual affiliate store around a large number of product feeds is progressing slowly, mainly due to other real work. But it has made some real progress.
I already have the product feeds (all ftp based) loaded and indexed in various ways including raw text search (inverted indexes based on stemming and lots of stopwords). I have a lot of set operations done as well. Everything runs in memory only (this is the read-only portion of the system). This memory-based query engine is designed to be distributed (accessed via REST) running on top of an embedded Jetty server and using XStream to marshall the data.
The biggest problems so far are (1) building a universal category tree and mapping all the merchant products into it (everyone has their own scheme) and (2) decided what type of interface to provide.
The first issue is basically an informational architecture problem. I looked at a lot of shopping sites and tried to discern what motivated their choices, then built myself one. Sadly I can't see any way yet past manually mapping the store categorizations to mine and then dealing with unknown new ones as they appear. The real solution would be to analyze all the information on a product and build a system capable of mapping the data automatically. That will become necessary when I add larger merchants like Walmart (850K products) and iTunes (2M+).
The interface question is interesting as I see three choices (1) pure ajax (2) pure html and (3) both.
A pure ajax site has the advantage of being much faster to use, given a UI that emphasizes many different ways to slice-and-dice the products. The downside is that Google sees nothing much and you don't get the discovery from people searching for products via Google.
If I provide an interface that ultimately lists all the products from all merchants in a discoverable URL scheme (e.g. /products/flowers/roses/) then all the data will be found in search engines (baring being punished for duplicate content). The downside is a ton of processing as the engines devour all the pages (between Walmart and iTunes that's 300K pages or so). I can cache this information (perhaps as zipped HTML) but it's still a ton of bandwidth and time.
I would like to do both so that people can find the "store" via Google, but still have the benefit of alternate ways to find (and build RSS feeds from) stuff in the store. Performance of all this is very predictable and not too difficult to scale (most of the work is happening in memory time). If I choose to support reviews and comments then I have to build a more robust database architecture (something to preserve the information); currently the query engine is totally read-only and updated once per day (the data feeds update that way). Currently I can load 3M products per hour into my Dual G5 development box.
So this is an interesting project so far, but I need to eat and pay bills so I can't work as much as I would like.
There are many shopping sites but I see directions that haven't been explored yet. All my investment is my time so far, so once it goes live (whenever) it doesn't have to big huge to be successful.
More later.

Ryan Doherty 04/20/2007 17:48
You can have both AJAX and a discoverable URL scheme. Your homepage can have links to the discoverable URL scheme, but in JS you can attach event handlers to stop the browser from going there and dynamically load the content into whatever part of the page you want. Progressive enhancement/graceful degradation.
You should also be gzipping your HTML anyway before you send it over the wire. And you could set an expires header for 1 day or something like that for the full listings. Hopefully that will reduce your bandwidth.
Maybe you could do a user agent check to see if it's a search engine and feed them the cached versions of pages that are generated daily?
And cache, cache, cache as much as you can. You have a lot of stuff to cache, but I'm sure there's a way to cache the most heavily used info so you reduce the workload.
Good luck!
Stephen 04/23/2007 18:12
By "everything is in memory", do you mean the database too? And in any case, how big is memory? I remember when 10 megabytes was a big database. But 10^9 / 2*10^6 is still 512 bytes per record. That's alot. More than this comment. And that's only one gigabyte of RAM - the minimum RAM needed to run Vista. Face it, one gigabyte is a low end desktop. (Don't tell me Vista will run in 500 MB - that's fiction, or a lie, or something.)
codist 04/23/2007 20:15
There will be a database but only for a few constant items. Most of the data is loaded fresh every day into RAM and processed on the fly. RAM as in 2-4G . I'd love an OctoMac, 16GB and 8 cores but oh well.
Andries 05/13/2007 03:06
Our company tried the same thing. In the end the whole project was canned due to a single business reality: Thousands of feeds does not mean that all products are of equal usefulness for consumers. Even as a marketing company, we just could not find a way to market those products effectively. Our company also made one big technical blunder: Instead of focusing all their time & resources in a effort to developing algorithms which filter out bogus products; They had us build this distributed search index, which sure -- in the end allowed us to do a search over about 9 million products under 1.5 seconds (worst case, including network latency & rendering on the browser), but failed to resolve the fact that that most queries brought back crap. Questions such as how to categorize products was put in the back burner, while we all had to pitch in to our, then Chief Technical Officer's wet dream of gaining speed across volume.
We gained speed by loading our master index into memory, and carefully monitoring each node on the network to ensure the index size never exceeded the available RAM. Scaling the search network simply meant adding another node on the network, either a physical box, or if a computer has more than one processor, we added another virtual node.
In retrospect, using RAM for speed is a cop out. It ups the cost of scaling, thereby negating the purpose for building a distributed system in the first place. Our CTO also forced us to store the actual data in the indexes as well which also increases the size of index (we ended up using the indexes as storage mechanism instead of a data retrieval mechanism). This limited us on exactly how much products could be pushed to each node before we run into the physical constraints of RAM.
In the end, i would recommend think hard and constantly about the architecture. Committing to a search goals such as under 1 second for a search (as in our case) may sound exciting (and believe me it is), but in the u would like users to visit your site and buy stuff. And a user is much more forgiving when he actually finds what he is looking for. Otherwise even if u serve up a search over many giga bytes of data in 0.3 seconds, that same user will go back to yahoo, amazon etc.
Rather spend time and effort in a good cataloging system, or use a scheme such as digg whereby users rate products, and feed that back into your catalog & search algorithm.