How to overcome distil_r_captcha when webscraping??
Hello, I am trying to scrape a website with scrapy (code below) but I keep getting redirected to a captcha page. I get a [200] response, but when I run the scraping I get stopped. Is my IP blacklisted? This is the output I get: DEBUG: Redirecting (meta refresh) to <GET https://www.funda.nl/distil_r_captcha.html?requestId=ce4a7cb1-1070-43e5-9548-4fd02e71345d&httpReferrer=%2Fkoop%2Famsterdam%2F50000-2000000%2F%3Fzoek%3D1077XM> from <GET https://www.funda.nl/koop/amsterdam/50000-2000000/?zoek=1077XM> 2019-11-03 02:28:55 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.funda.nl/distil_r_captcha.html?requestId=ce4a7cb1-1070-43e5-9548-4fd02e71345d&httpReferrer=%2Fkoop%2Famsterdam%2F50000-2000000%2F%3Fzoek%3D1077XM> (referer: None) 2019-11-03 02:28:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.funda.nl/distil_r_captcha.html?requestId=ce4a7cb1-1070-43e5-9548-4fd02e71345d&httpReferrer=%2Fkoop%2Famsterdam%2F50000-2000000%2F%3Fzoek%3D1077XM>: HTTP status code is not handled or not allowed # -*- coding: utf-8 -*- import scrapy from scrapy.crawler import CrawlerProcess import requests class HousesearchspiderSpider(scrapy.Spider): name = "housesearchspider" download_delay = 10.0 start_urls = [ 'https://www.funda.nl/koop/amsterdam/50000-2000000/?zoek=1077XM', ] def parse(self, response): for detail in response.css('div.search-result-content'): yield {'price':detail.css('div.search-result-info search-result-info-price ::text').get(), 'size': detail.css('ul.search-result-kenmerken ::text').get(), 'postcode': detail.css('small.search-result-subtitle ::text').get(), 'street': detail.css('h2.search-result-title ::text').get(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) sleep(5) yiel