+ 6

Which language is mostly used to make Web crawlers?

interested

6th Oct 2018, 4:56 PM
Endur Muunganirwa
Endur Muunganirwa - avatar
9 Answers
+ 14
Here's a list of 50 open source web crawlers http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/ You mighy notice that Java has high presence in the list tho Scrapy (python) is quite popular
6th Oct 2018, 5:28 PM
Burey
Burey - avatar
+ 3
7th Oct 2018, 4:47 AM
Allan Cao
Allan Cao - avatar
+ 1
I use a Linux program called wget from the command line. This program allows me to get any files, as well as get directory structures and much more. In practice, I use wget to check for webpage updates. I have a bash script that runs twice a month to check if certain webpage have been changed. Specifically, in practice, I check nmap.org homepage and their download page. wget could be easily used for custom web crawling if used in combination with bash. Download [ directory structure | webpage | file ] process data to fit your needs save processed code (MySQL database, perhaps) compare code to last time it was crawled
7th Oct 2018, 3:43 PM
SadPuppies
SadPuppies - avatar
7th Oct 2018, 4:03 PM
Pandiselvi
0
Here an web crawl of Sololearn datau using JAVASCRIPT https://code.sololearn.com/WnOKdgzZQCeO/?ref=app
8th Oct 2018, 3:26 AM
CalviŐ˛
CalviŐ˛ - avatar
0
C++
9th Oct 2018, 4:34 AM
ADARSH RATHOD
ADARSH RATHOD - avatar
- 1
What are web crawlers?
8th Oct 2018, 3:21 AM
Daniel Cooper
Daniel Cooper - avatar
8th Oct 2018, 11:21 AM
Erion Gogu
Erion Gogu - avatar
- 1
To follow up on my first reply and add more information to the process, wget allows for downloading each webpage along with all elements either as a single file or combined into one. The man pages on wget has a wealth of information about the possibility that can explain much more than I could hope to achieve whilst keeping your attention. The act of processing the data in this example-case would likely user a Linux Shell such as Bash or Dash. This is useful, for example, if you were to use the meta tag attributes of to display with a user's query. Google, for example shows the page's meta description beneath the page title and link. Another use for processing data would be to detect plagiarism. [Detecting plagiarism is beyond the scope of this topic.] Saving to a database is relatively self-explanatory: After all, you'll need a way to query the data later. After comparing the code to last time the same resource was crawled, you'll have to process the information again to determine how to handle difference
9th Oct 2018, 5:46 AM
SadPuppies
SadPuppies - avatar