Cast: Vamsee Krishna, Nirtyanath Jaganathan, Leo Thumma, and Sam Stern
Our final project for CIS 455 (Internet and Web Systems) was to build a search engine. Not a search algorithm, not a frontend for a search engine, but an entire search engine from start to finish. That included a crawler, and indexer, a PageRank-er, and a web frontend. All of these components had to be distributed over a 10-machine, 50-core Amazon EC2 computing network. We also threw in some extra features like intelligent spellchecking, DuckDuckGo contextual integration, and EBay results to make it more fun.
Most of the distribution organization was performed using FreePastry to distribute work based on a partition of the URL namespace.
The name of our product? OutOfMemoryError. The product was named for the error that plagued our development from day one. Eventually, to eradicate the error, we removed every single data structure from memory and replaced it with our own combination of transactional Berkeley databases, S3 storage, and other persistent data stores. The result was an intensely robust system that could recover seamlessly from any fault (you could even pull the plug on the computer and lose nothing).
This is the coolest thing I ever coded, below are some pictures of the final product. I’ll also include a link to our documentation paper where you can get a better understanding of the program architecture.
The home page:
Searching for Barack Obama (notice the DuckDuckGo integration on the righthand side):
Searching for MacBook (with Ebay results):
The results are pretty good, but they do not do justice to our software because they were limited by the size of the corpus. Due to time constraints we only crawled about half a million pages. Not a small amount, but the internet is 50 billion pages or more as a whole, so we hit less than 1/100,000 pages out there. In that context, the relevance of our results is pretty great.
Here’s the link to the documentation, where you can learn more about how we did this:
If you’d like to take a look at the source code, let me know in an email.