How to overcome distil_r_captcha when webscraping??

Hello, I am trying to scrape a website with scrapy (code below) but I keep getting redirected to a captcha page. I get a [200] response, but when I run the scraping I get stopped. Is my IP blacklisted? This is the output I get: DEBUG: Redirecting (meta refresh) to <GET https://www.funda.nl/distil_r_captcha.html?requestId=ce4a7cb1-1070-43e5-9548-4fd02e71345d&httpReferrer=%2Fkoop%2Famsterdam%2F50000-2000000%2F%3Fzoek%3D1077XM> from <GET https://www.funda.nl/koop/amsterdam/50000-2000000/?zoek=1077XM> 2019-11-03 02:28:55 [scrapy.core.engine] DEBUG: Crawled (405) <GET https://www.funda.nl/distil_r_captcha.html?requestId=ce4a7cb1-1070-43e5-9548-4fd02e71345d&httpReferrer=%2Fkoop%2Famsterdam%2F50000-2000000%2F%3Fzoek%3D1077XM> (referer: None) 2019-11-03 02:28:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.funda.nl/distil_r_captcha.html?requestId=ce4a7cb1-1070-43e5-9548-4fd02e71345d&httpReferrer=%2Fkoop%2Famsterdam%2F50000-2000000%2F%3Fzoek%3D1077XM>: HTTP status code is not handled or not allowed # -*- coding: utf-8 -*- import scrapy from scrapy.crawler import CrawlerProcess import requests class HousesearchspiderSpider(scrapy.Spider): name = "housesearchspider" download_delay = 10.0 start_urls = [ 'https://www.funda.nl/koop/amsterdam/50000-2000000/?zoek=1077XM', ] def parse(self, response): for detail in response.css('div.search-result-content'): yield {'price':detail.css('div.search-result-info search-result-info-price ::text').get(), 'size': detail.css('ul.search-result-kenmerken ::text').get(), 'postcode': detail.css('small.search-result-subtitle ::text').get(), 'street': detail.css('h2.search-result-title ::text').get(), } next_page = response.css('li.next a::attr(href)').get() if next_page is not None: next_page = response.urljoin(next_page) sleep(5) yiel

python webscraping proxy

3rd Nov 2019, 1:29 AM

Brunna Villar

2 Answers

+ 2

Well you use a bot to scrape a website and their bot detection triggered so everything is working as intended I'd say. Your IP is probably not blacklisted. But nobody other than google knows for sure how recaptcha works. They track everything you do on the website and then some algorithm figures out if a human would use a browser like that. I've gotten around this before by scraping with headless chrome and filling out the captchas by hand when they come up, before the program takes over. But that's obviously not ideal. There's no point in fighting recaptcha I think. If the website doesn't have an API where you can get the data in a nice way then there's not much you can do. I hear some bots can solve recaptchas but I don't know too much about that. You can also hire indian people to solve captchas for you (no joke) but I'm not sure I'd want to support a company that makes people solve captchas all day long.

3rd Nov 2019, 3:06 AM

Schindlabua

I came across this thread from a Google search, and I am seeing the same 'distil_r_captcha.html' issue. As luck would have it, I am also trying to crawl funda.nl (things we do to beat the AMS housing market, am I right?), but I haven't found a work-around either, and wanted to see if you made any progress since you posted your question. Funda probably blocks bots for their /koop/ URLs, as you can see from their robots.txt that they disallow bots in several of their URLs, with /koop/ being one of them, so I'm starting to think that it may be impossible to use scraping bots for Funda.

10th Nov 2019, 7:37 AM

Jangboo Lee