+ 1
how to create a web scraper in python?
Just figuring It out
9 Respuestas
+ 4
import urllib.request
import bs4
def scrape():
base_url = 'https://python.org'
request = urlopen(base_url)
print(request.content) # raw source code
bs_obj = bs4.BeautifulSoup(request)
# printing all links
for i in bs_obj.find_all('a'):
# looking for specific traits of the content
# printing paragraphs with the class of info
for i in bs_obj.find_all('p', {'class':'info'}):
print(i.text) # printing a specific set of paragraphs without tags
+ 3
my simple example of a web scraper
import urllib.request
import bs4
def scrape():
base_url = 'https://python.org'
request = urlopen(base_url)
print(request.content) # raw source code
bs_obj = bs4.BeautifulSoup(request)
# printing all links
for i in bs_obj.find_all('a'):
+ 1
following this post. Very interesting and I would like to understand how web scraper works.
+ 1
THX for the answer...for more specific optional,such as finding specific words inside or One page of an entire site or group of sites have u any suggestion?
+ 1
David Toscano
Web-Scraper in Python
If you have basic knowledge in Python, you know Pip.
With pip, you import new not-build-in modules.
You need at least the following modules:
better you build in, a ask for a certificate.
So you need certifi
so pip certifi too.
we import what we need:
import urllib3
import certifi
from bs4 import BeautifulSoup
What s going on?
Python have to do now, what your browser do, usually. It should browsw html-documents.
This is also called parsing.
we create a object
http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certtifi.where())
Now we give Python the side to browse:
site = http.request("GET", "www.sololearn.com")
soup= BeautifulSoup(site.data, "html.parser")
We told Python to parse the html document if the Server got a certificate.
Now you can do different things.
I e you can look for a specific html.element.
result = soup.find_all("h3",class"text")
#only example.
go.tho.the website and look for the html.element.
kk. to show results
+ 1
row per row, you can take a list.
there are many dolutions, i guess, this is my atm..
list_results = [] # empty list
def show_resuts(list_results, result):
for on in x:
show_results(list_result, result)
# sry, the function, i dont would like to write it again. we dont love typing^^
or you look for a specific word.
After the site to parse, we give Python a other job, we search in first 50 signs for the string: "DOCTYPE HTML"
check_site = str(r.data[0:50])
# we must convert into.the rigth datatyp.
pattern = r"<!DOCTYPE HTML"
if re.search(pattern, check_site):
print("Site is HTML 5")
pattern2 = r"<!doctype html"
if re.search(pattern2, check_site):
print("Doctype HTML is labeled")
print("Doctype HTML isnt labeled")
Hope this helps a bit :)
+ 1
row 7:
false right is:
for i in x:
in other cases it raises a error :)
there s a lot of examples online but i need an answer more practical than teoretical
use lxml it is much faster then bs4. for large projects use scrapy with its powerful threading. stackoverflow has tons of examples on this.