+ 1

how to create a web scraper in python?

Just figuring It out

31st Dec 2016, 7:20 AM
David Toscano
9 ответов
+ 4
import urllib.request import bs4 def scrape(): base_url = 'https://python.org' request = urlopen(base_url) print(request.content) # raw source code bs_obj = bs4.BeautifulSoup(request) # printing all links for i in bs_obj.find_all('a'): print(i['href']) # looking for specific traits of the content # printing paragraphs with the class of info for i in bs_obj.find_all('p', {'class':'info'}): print(i.text) # printing a specific set of paragraphs without tags scrape()
2nd Jan 2017, 10:13 PM
Given
Given - avatar
+ 3
my simple example of a web scraper import urllib.request import bs4 def scrape(): base_url = 'https://python.org' request = urlopen(base_url) print(request.content) # raw source code bs_obj = bs4.BeautifulSoup(request) # printing all links for i in bs_obj.find_all('a'): print(i['href']) scrape()
2nd Jan 2017, 9:59 PM
Given
Given - avatar
+ 1
following this post. Very interesting and I would like to understand how web scraper works.
31st Dec 2016, 7:29 AM
Shiv Kumar
Shiv Kumar - avatar
+ 1
THX for the answer...for more specific optional,such as finding specific words inside or One page of an entire site or group of sites have u any suggestion?
2nd Jan 2017, 10:05 PM
David Toscano
+ 1
David Toscano Web-Scraper in Python If you have basic knowledge in Python, you know Pip. With pip, you import new not-build-in modules. You need at least the following modules: urllib3 bs4 better you build in, a ask for a certificate. So you need certifi so pip certifi too. we import what we need: import urllib3 import certifi from bs4 import BeautifulSoup What s going on? Python have to do now, what your browser do, usually. It should browsw html-documents. This is also called parsing. we create a object http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certtifi.where()) Now we give Python the side to browse: site = http.request("GET", "www.sololearn.com") soup= BeautifulSoup(site.data, "html.parser") We told Python to parse the html document if the Server got a certificate. Now you can do different things. I e you can look for a specific html.element. result = soup.find_all("h3",class"text") #only example. go.tho.the website and look for the html.element. kk. to show results
3rd Sep 2018, 5:15 AM
Sven_m
Sven_m - avatar
+ 1
row per row, you can take a list. there are many dolutions, i guess, this is my atm.. list_results = [] # empty list def show_resuts(list_results, result):     x.append(y)     for on in x:         print(i) show_results(list_result, result) # sry, the function, i dont would like to write it again. we dont love typing^^ or you look for a specific word. After  the site to parse, we give Python a other job, we search in first 50 signs for the string: "DOCTYPE HTML" check_site = str(r.data[0:50]) # we must convert into.the rigth datatyp. pattern = r"<!DOCTYPE HTML" if re.search(pattern, check_site):     print("Site is HTML 5") else:     pattern2 = r"<!doctype html"     if re.search(pattern2, check_site):         print("Doctype HTML is labeled")     else:         print("Doctype HTML isnt labeled") Hope this helps a bit :)
3rd Sep 2018, 5:30 AM
Sven_m
Sven_m - avatar
+ 1
row 7: false right is: for i in x: in other cases it raises a error :)
3rd Sep 2018, 6:11 AM
Sven_m
Sven_m - avatar
0
there s a lot of examples online but i need an answer more practical than teoretical
31st Dec 2016, 7:45 AM
David Toscano
0
use lxml it is much faster then bs4. for large projects use scrapy with its powerful threading. stackoverflow has tons of examples on this.
29th Apr 2017, 4:48 AM
Artem Anatol
Artem Anatol - avatar