+ 1
how to create a web scraper in python?
Just figuring It out
9 Answers
+ 4
import urllib.request
import bs4
def scrape():
base_url = 'https://python.org'
request = urlopen(base_url)
print(request.content) # raw source code
bs_obj = bs4.BeautifulSoup(request)
# printing all links
for i in bs_obj.find_all('a'):
print(i['href'])
# looking for specific traits of the content
# printing paragraphs with the class of info
for i in bs_obj.find_all('p', {'class':'info'}):
print(i.text) # printing a specific set of paragraphs without tags
scrape()
+ 3
my simple example of a web scraper
import urllib.request
import bs4
def scrape():
base_url = 'https://python.org'
request = urlopen(base_url)
print(request.content) # raw source code
bs_obj = bs4.BeautifulSoup(request)
# printing all links
for i in bs_obj.find_all('a'):
print(i['href'])
scrape()
+ 1
following this post. Very interesting and I would like to understand how web scraper works.
+ 1
THX for the answer...for more specific optional,such as finding specific words inside or One page of an entire site or group of sites have u any suggestion?
+ 1
David Toscano
Web-Scraper in Python
If you have basic knowledge in Python, you know Pip.
With pip, you import new not-build-in modules.
You need at least the following modules:
urllib3
bs4
better you build in, a ask for a certificate.
So you need certifi
so pip certifi too.
we import what we need:
import urllib3
import certifi
from bs4 import BeautifulSoup
What s going on?
Python have to do now, what your browser do, usually. It should browsw html-documents.
This is also called parsing.
we create a object
http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certtifi.where())
Now we give Python the side to browse:
site = http.request("GET", "www.sololearn.com")
soup= BeautifulSoup(site.data, "html.parser")
We told Python to parse the html document if the Server got a certificate.
Now you can do different things.
I e you can look for a specific html.element.
result = soup.find_all("h3",class"text")
#only example.
go.tho.the website and look for the html.element.
kk. to show results
+ 1
row per row, you can take a list.
there are many dolutions, i guess, this is my atm..
list_results = [] # empty list
def show_resuts(list_results, result):
x.append(y)
for on in x:
print(i)
show_results(list_result, result)
# sry, the function, i dont would like to write it again. we dont love typing^^
or you look for a specific word.
After the site to parse, we give Python a other job, we search in first 50 signs for the string: "DOCTYPE HTML"
check_site = str(r.data[0:50])
# we must convert into.the rigth datatyp.
pattern = r"<!DOCTYPE HTML"
if re.search(pattern, check_site):
print("Site is HTML 5")
else:
pattern2 = r"<!doctype html"
if re.search(pattern2, check_site):
print("Doctype HTML is labeled")
else:
print("Doctype HTML isnt labeled")
Hope this helps a bit :)
+ 1
row 7:
false right is:
for i in x:
in other cases it raises a error :)
0
there s a lot of examples online but i need an answer more practical than teoretical
0
use lxml it is much faster then bs4. for large projects use scrapy with its powerful threading. stackoverflow has tons of examples on this.