+ 2

How to make a search engine ?

I want to make search engine for website which has files(PDF,word etc) . can you tell me from where I can learn to make one. I don't want to use custom search engine

24th Jul 2017, 3:40 PM
Ali Tahir
Ali Tahir - avatar
4 odpowiedzi
+ 4
There are four core activities in a search engine: crawling, indexing, ranking and query serving. For crawling, it does not make sense to bother doing your own as nowadays there are many good open source choices such as Nutch, Scrapy, Heritrix etc. or you might do even better and forgo crawling in the beginning and use CommonCrawl excellent crawl which is regularly updated. For indexing, do not rely on external tools, it is best to write your own indexer as you will learn what search is all about. Just keep in mind that indexing is not rocket science, an index is simply a collection of lists of results for keywords and N-grams (phrases). It is all about packing this collection efficiently, and making lookup and insertion of new entries fast and efficient, keeping in mind that they trade off against each other. You will probably pay more attention to fast lookup though newer applications, such as social, are more write-intense. You will definitely want to do your own ranking, starting with a simple implementation of Pagerank which you may want to modify and experiment with e.g. with non-uniform exit link probabilities of links on pages. It is not that difficult to run Pagerank efficiently even on big datasets, you will need to implement some pretty basic block partitioning. After that, start experimenting with different heuristics for page and link elements such as titles, headings, anchor texts, term frequencies etc. Finally, for query serving do pretty basic stuff e.g. simple document partitioning on few machines with some forethought how to scale that later on. It is conceivable today to cram a simple index (say few KB per page) of over 1 billion pages in a single machine with multiple TB drives and lots of RAM, 32GB at the minimum. If you are more adventurous throw in an SSD for mid layer and optimize it for caching reads, with infrequent writes. It is possible for a single dedicated programmer to do it in 6-12 months. Have fun at it and let us know how it turned out!
24th Jul 2017, 3:51 PM
~Sudo Bash
~Sudo Bash - avatar
+ 3
try Google Custom Search https://cse.google.com/cse/
24th Jul 2017, 3:42 PM
The Coding Sloth
The Coding Sloth - avatar
+ 3
@pardeep, just copy and paste the link
24th Jul 2017, 3:53 PM
The Coding Sloth
The Coding Sloth - avatar
+ 1
read last line : don't want to use any custom search engines
24th Jul 2017, 3:46 PM
Ali Tahir
Ali Tahir - avatar