We Built a Search Engine – Presenting OutOfMemoryError

Cast: Vamsee Krishna, Nirtyanath Jaganathan, Leo Thumma, and Sam Stern

Our final project for CIS 455 (Internet and Web Systems) was to build a search engine.  Not a search algorithm, not a frontend for a search engine, but an entire search engine from start to finish.  That included a crawler, and indexer, a PageRank-er, and a web frontend.  All of these components had to be distributed over a 10-machine, 50-core Amazon EC2 computing network.  We also threw in some extra features like intelligent spellchecking, DuckDuckGo contextual integration, and EBay results to make it more fun.

Most of the distribution organization was performed using FreePastry to distribute work based on a partition of the URL namespace.

The name of our product?  OutOfMemoryError.  The product was named for the error that plagued our development from day one.  Eventually, to eradicate the error, we removed every single data structure from memory and replaced it with our own combination of transactional Berkeley databases, S3 storage, and other persistent data stores.   The result was an intensely robust system that could recover seamlessly from any fault (you could even pull the plug on the computer and lose nothing).

This is the coolest thing I ever coded, below are some pictures of the final product.  I’ll also include a link to our documentation paper where you can get a better understanding of the program architecture.

The home page:

Image

Searching for Barack Obama (notice the DuckDuckGo integration on the righthand side):

Image

Searching for MacBook (with Ebay results):

Image

The results are pretty good, but they do not do justice to our software because they were limited by the size of the corpus.   Due to time constraints we only crawled about half a million pages.  Not a small amount, but the internet is 50 billion pages or more as a whole, so we hit less than 1/100,000 pages out there.   In that context, the relevance of our results is pretty great.

Here’s the link to the documentation, where you can learn more about how we did this:

Final Report

If you’d like to take a look at the source code, let me know in an email.

Advertisements

5 thoughts on “We Built a Search Engine – Presenting OutOfMemoryError

    • I was responsible for the crawler, the DuckDuckGo integration, the frontend servlet, and the spell checker. My teammates were responsible for what I’d consider the harder parts, the indexer and the PageRank implementation.

  1. By the way hatboysam, you got the most important piece of the search enginer. Right combination of constants for search engine to give meaningful results. PageRank implementation was joke and it is trivial. give yourself some credit, would you?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s