One of the projects I am working on is implementing a simple search engine for my company platform. It has taken me a while to get all of the pieces together so I thought I would share some of my learning notes.

Keep in mind: These notes are intended to provide some familiarity with the concept of building a search engine to facilitate learning and exploration but are in no way intended to go into specific technical details.

So, a standard search engine normally consists of three important components, mainly:

  • We need a web crawler to gather raw content from multiple web pages (or from the overall web).

  • We need an indexer to structure our parsed content so that we can search it quickly at a later date, think of a book index. We can use the index to find content in the book fast by using keywords.

  • And finally we need the actual search query, here we can enter text which can be searched in our stored index.

After some research (initially I was working with the Sphinx search engine), I settled on a pair of really good open source programs that spawned from the Apache Lucene search project.

  • Apache Nutch: A widely used and stable web crawler. It is written primarily in Java.

  • Apache Solr: A popular search server that does both indexing and querying and can be integrated with Nutch.

Now that we know “the name” of the tools we will be using, and why they are necessary, we need to learn how to actually use them.

For this purpose I wrote 3 short tutorials, they are intended to introduce the search tools, run a few commands and get familiar with the software. At the end you will have a little search engine and a starting point for all sorts of interesting projects.

  1. Solr5 Quick Tutorial, Indexing Content.

  2. Nutch1 Quick Tutorial, Learning to Crawl.

  3. Nutch1-Solr5 Integration, Searching the Web.

Have fun searching, enjoy and share :)




Leave your comments below (or comment directly here).

Thank you for your feedback.