+ 1

Editing a Web Scrapper to Stop

Hello, I found this code online, but it takes a really long time to run even one website because it seems to check all of the sites pages. I was wondering if someone could help me code this so that it stops once it has collected X emails from a website and then have it move on the the next website (ie: collect 5 emails from TheMoscowTimes, then collect 5 emails from IMadeThisUp). #Code Starts# from bs4 import BeautifulSoup import requests import requests.exceptions from urllib.parse import urlsplit from collections import deque import re # a queue of urls to be crawled new_urls = deque(['http://www.themoscowtimes.com/contact_us/index.php']) # a set of urls that we have already crawled processed_urls = set() # a set of crawled emails emails = set() # process urls one by one until we exhaust the queue while len(new_urls): # move next url from the queue to the set of processed urls url = new_urls.popleft() processed_urls.add(url) # extract base url to resolve relative links parts = urlsplit(url) base_url = "{0.scheme}://{0.netloc}".format(parts) path = url[:url.rfind('/')+1] if '/' in parts.path else url # get url's content print("Processing %s" % url) try: response = requests.get(url) except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): # ignore pages with errors continue # extract all email addresses and add them into the resulting set new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I)) emails.update(new_emails) # create a beutiful soup for the html document soup = BeautifulSoup(response.text) # find and process all the anchors in the document for anchor in soup.find_all("a"): # extract link url from the anchor link = anchor.attrs["href"] if "href" in anchor.attrs else '' # resolve relative links if link.startswith('/'): link = bas

python loops web data collection emails scrapping

27th Jan 2018, 2:54 PM

Adrian Tippit

2 Réponses

+ 1

maybe nested loop? for i in targetlist: #create list of pages to crawl emailsfound = 0 while emailsfound < 5: crawl here emailsfound = 0 # to reset counter for next target

4th Feb 2018, 3:17 PM

Markus Kaleton