How a web crawler of search engine can discover any websites(even new created) and what it takes in order to create a serach engine from scratch?

112 viewsOtherTechnology

How a web crawler of search engine can discover any websites(even new created) and what it takes in order to create a serach engine from scratch?

In: Technology

Anonymous 0 Comments

It can’t, but there is quite a few signals that it can use. Google, for example, has quite a few sources that they can use to discover new sites – for example Google Analytics, links from pages they already process (for example reddit), etc.

It was also speculated that they’d use any host lookup performed through their public DNS servers, but this has been debunked by Google (which you can trust as much as you want, but it would be a very stupid thing to lie about).

If you want to build your own search engine, you’re going to need a lot of resources to spider the web (i.e. connect to sites, download their content, and keep doing it “forever”), and then process that into a format that is suitable for searching. A hobbyist project can use Nutch to create a simple spider and something based on Lucene to create a simple search engine – for example Solr, OpenSearch, Elasticsearch, etc.

If you want to do it from scratch, you should research the topic information retrieval.

You can build a simple ranking search engine in a day or two, but the challenge becomes interesting when you grow it outside of 100 or 1000 documents and still want to maintain relevancy against people who try to game the system.