+ 6
Which language is mostly used to make Web crawlers?
interested
9 ответов
+ 14
Here's a list of 50 open source web crawlers
http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining/
You mighy notice that Java has high presence in the list
tho Scrapy (python) is quite popular
+ 3
Don't forget about BS4 in python.
https://pypi.org/project/beautifulsoup4/
+ 1
I use a Linux program called wget from the command line. This program allows me to get any files, as well as get directory structures and much more.
In practice, I use wget to check for webpage updates. I have a bash script that runs twice a month to check if certain webpage have been changed.
Specifically, in practice, I check nmap.org homepage and their download page.
wget could be easily used for custom web crawling if used in combination with bash.
Download [ directory structure | webpage | file ]
process data to fit your needs
save processed code (MySQL database, perhaps)
compare code to last time it was crawled
0
Here an web crawl of Sololearn datau using JAVASCRIPT
https://code.sololearn.com/WnOKdgzZQCeO/?ref=app
0
C++
- 1
What are web crawlers?
- 1
- 1
To follow up on my first reply and add more information to the process, wget allows for downloading each webpage along with all elements either as a single file or combined into one. The man pages on wget has a wealth of information about the possibility that can explain much more than I could hope to achieve whilst keeping your attention.
The act of processing the data in this example-case would likely user a Linux Shell such as Bash or Dash. This is useful, for example, if you were to use the meta tag attributes of to display with a user's query. Google, for example shows the page's meta description beneath the page title and link.
Another use for processing data would be to detect plagiarism. [Detecting plagiarism is beyond the scope of this topic.]
Saving to a database is relatively self-explanatory: After all, you'll need a way to query the data later.
After comparing the code to last time the same resource was crawled, you'll have to process the information again to determine how to handle difference