+ 6

Which language is mostly used to make Web crawlers?

interested

6th Oct 2018, 4:56 PM

Endur Muunganirwa

9 ответов

+ 14

Here's a list of 50 open source web crawlers http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/ You mighy notice that Java has high presence in the list tho Scrapy (python) is quite popular

6th Oct 2018, 5:28 PM

Burey

+ 3

Don't forget about BS4 in python. https://pypi.org/project/beautifulsoup4/

7th Oct 2018, 4:47 AM

Allan Cao

+ 1

I use a Linux program called wget from the command line. This program allows me to get any files, as well as get directory structures and much more. In practice, I use wget to check for webpage updates. I have a bash script that runs twice a month to check if certain webpage have been changed. Specifically, in practice, I check nmap.org homepage and their download page. wget could be easily used for custom web crawling if used in combination with bash. Download [ directory structure | webpage | file ] process data to fit your needs save processed code (MySQL database, perhaps) compare code to last time it was crawled

7th Oct 2018, 3:43 PM

SadPuppies

python

7th Oct 2018, 4:03 PM

Pandiselvi

Here an web crawl of Sololearn datau using JAVASCRIPT https://code.sololearn.com/WnOKdgzZQCeO/?ref=app

8th Oct 2018, 3:26 AM

Calviղ

C++

9th Oct 2018, 4:34 AM

ADARSH RATHOD

- 1

What are web crawlers?

8th Oct 2018, 3:21 AM

Daniel Cooper

- 1

Python

8th Oct 2018, 11:21 AM

Erion Gogu

- 1

To follow up on my first reply and add more information to the process, wget allows for downloading each webpage along with all elements either as a single file or combined into one. The man pages on wget has a wealth of information about the possibility that can explain much more than I could hope to achieve whilst keeping your attention. The act of processing the data in this example-case would likely user a Linux Shell such as Bash or Dash. This is useful, for example, if you were to use the meta tag attributes of to display with a user's query. Google, for example shows the page's meta description beneath the page title and link. Another use for processing data would be to detect plagiarism. [Detecting plagiarism is beyond the scope of this topic.] Saving to a database is relatively self-explanatory: After all, you'll need a way to query the data later. After comparing the code to last time the same resource was crawled, you'll have to process the information again to determine how to handle difference

9th Oct 2018, 5:46 AM

SadPuppies