+ 1

how to create a web scraper in python?

Just figuring It out

31st Dec 2016, 7:20 AM

David Toscano

9 Respostas

+ 4

import urllib.request import bs4 def scrape(): base_url = 'https://python.org' request = urlopen(base_url) print(request.content) # raw source code bs_obj = bs4.BeautifulSoup(request) # printing all links for i in bs_obj.find_all('a'): print(i['href']) # looking for specific traits of the content # printing paragraphs with the class of info for i in bs_obj.find_all('p', {'class':'info'}): print(i.text) # printing a specific set of paragraphs without tags scrape()

2nd Jan 2017, 10:13 PM

Given

+ 3

my simple example of a web scraper import urllib.request import bs4 def scrape(): base_url = 'https://python.org' request = urlopen(base_url) print(request.content) # raw source code bs_obj = bs4.BeautifulSoup(request) # printing all links for i in bs_obj.find_all('a'): print(i['href']) scrape()

2nd Jan 2017, 9:59 PM

Given

+ 1

following this post. Very interesting and I would like to understand how web scraper works.

31st Dec 2016, 7:29 AM

Shiv Kumar

+ 1

THX for the answer...for more specific optional,such as finding specific words inside or One page of an entire site or group of sites have u any suggestion?

2nd Jan 2017, 10:05 PM

David Toscano

+ 1

David Toscano Web-Scraper in Python If you have basic knowledge in Python, you know Pip. With pip, you import new not-build-in modules. You need at least the following modules: urllib3 bs4 better you build in, a ask for a certificate. So you need certifi so pip certifi too. we import what we need: import urllib3 import certifi from bs4 import BeautifulSoup What s going on? Python have to do now, what your browser do, usually. It should browsw html-documents. This is also called parsing. we create a object http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certtifi.where()) Now we give Python the side to browse: site = http.request("GET", "www.sololearn.com") soup= BeautifulSoup(site.data, "html.parser") We told Python to parse the html document if the Server got a certificate. Now you can do different things. I e you can look for a specific html.element. result = soup.find_all("h3",class"text") #only example. go.tho.the website and look for the html.element. kk. to show results

3rd Sep 2018, 5:15 AM

Sven_m

+ 1

row per row, you can take a list. there are many dolutions, i guess, this is my atm.. list_results = [] # empty list def show_resuts(list_results, result): x.append(y) for on in x: print(i) show_results(list_result, result) # sry, the function, i dont would like to write it again. we dont love typing^^ or you look for a specific word. After the site to parse, we give Python a other job, we search in first 50 signs for the string: "DOCTYPE HTML" check_site = str(r.data[0:50]) # we must convert into.the rigth datatyp. pattern = r"<!DOCTYPE HTML" if re.search(pattern, check_site): print("Site is HTML 5") else: pattern2 = r"<!doctype html" if re.search(pattern2, check_site): print("Doctype HTML is labeled") else: print("Doctype HTML isnt labeled") Hope this helps a bit :)

3rd Sep 2018, 5:30 AM

Sven_m

+ 1

row 7: false right is: for i in x: in other cases it raises a error :)

3rd Sep 2018, 6:11 AM

Sven_m

there s a lot of examples online but i need an answer more practical than teoretical

31st Dec 2016, 7:45 AM

David Toscano

use lxml it is much faster then bs4. for large projects use scrapy with its powerful threading. stackoverflow has tons of examples on this.

29th Apr 2017, 4:48 AM

Artem Anatol