+ 2

How to overcome http error 405 in python webscraping??

python3

9th Dec 2017, 6:52 PM

susmitha

18 odpowiedzi

+ 7

Maybe try providing the UserAgent variable for faking a web browser. Can you publish the code here?

10th Dec 2017, 8:13 AM

Kuba Siekierzyński

+ 5

Error 405 means "Method not allowed". Probably you are using a method, while scraping, which is forbidden or disabled on the target server.

9th Dec 2017, 7:14 PM

Kuba Siekierzyński

+ 5

Perhaps some kind of an authorization mechanism kicks in, like captcha or something. Could you please publish the code? It will be easier to see if something can be done about it...

10th Dec 2017, 1:02 PM

Kuba Siekierzyński

+ 4

Well, evidently your request was recognized as a robot. Perhaps you could do better with using the urllib.request.Request rather than requests.get - I prepared a function like this: from urllib.request import Request, urlopen def scrape(link): try: headers = {} headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" return str(urlopen(Request(link, headers = headers)).read()) except Exception as e: print(str(e)) Check it out and see if you have the same problem or not.

10th Dec 2017, 2:41 PM

Kuba Siekierzyński

+ 4

Yes, but the 200 status is returned already by the captcha page, this is why you can access it no problem. Most likely the server recognizes your IP and serves you this page instead of the one you want to access in the first place. I don't think there is anything you can do apart from changing the IP you access the page from. Or using a regular webscraper like Scrapy, which handles such cases quite well.

10th Dec 2017, 3:13 PM

Kuba Siekierzyński

+ 3

Hmm... maybe try using a proper useragent string, like: Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 Or see the module specially prepared for this: https://pypi.python.org/pypi/user-agents

10th Dec 2017, 1:36 PM

Kuba Siekierzyński

+ 3

Yes, scrapy works with Python3 alright. You can combine it with bs4 for parsing: https://pypi.python.org/pypi/Scrapy

10th Dec 2017, 3:51 PM

Kuba Siekierzyński

+ 2

Okay then, if you keep getting the same captcha page every time, it means your IP has been temporarily blocked by the website. Happened to me too, while trying to webscrape some google news. You can't really do anything with the code - try either using a proxy to change your IP or wait a while until they unblock you again. By the way, if you try to access thise website with a regular browser (now!) -- does it work fine?

10th Dec 2017, 3:03 PM

Kuba Siekierzyński

+ 1

then how to code to get that websites information??

10th Dec 2017, 3:11 AM

susmitha

by using user agent I have access to that website response status code 200.. But when I try to parse it shows little content and ends with distil_iden_block... and can't get information what I want... how to get that... can u help in that..

10th Dec 2017, 12:22 PM

susmitha

import bs4 import requests from bs4 import BeautifulSoup as soup url="http://www.realtor.ca" headers={"user_agent":"mozilla"} req=requests.get(url, headers=headers) page=soup (req.text, "html. parser") when I try to parse like this : page.contents it shows little amount of data.. when I see source code of that it shows large amount of code..

10th Dec 2017, 1:09 PM

susmitha

This is my code: import requests import bs4 from urllib.request import urlopen as ureq from bs4 import BeautifulSoup as soup my_url=("https://www.realtor.ca/Residential/Map.aspx#CultureId=1&ApplicationId=1&RecordsPerPage=9&MaximumResults=9&PropertySearchTypeId=1&TransactionTypeId=2&StoreyRange=0-0&BedRange=0-0&BathRange=0-0&LongitudeMin=-75.74395179748535&LongitudeMax=-75.66129684448242&LatitudeMin=45.403331468019175&LatitudeMax=45.43664692675851&SortOrder=A&SortBy=1&viewState=m&Longitude=-75.7026243209839&Latitude=45.4199916546208&ZoomLevel=14&PropertyTypeGroupID=1") headers={"user-agent":"Mozilla/5.0 (Windows NT 6.1; rv:57.0) Gecko/20100101 Firefox/57.0"} req=requests.get(my_url,headers=headers) page_soup=soup(req.text,"html.parser") and when i run it, it shows like this: >> req <Response [200]> >>> page_soup.contents ['html', '\n', <html> <head> <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/> <meta content="max-age=0" http-equiv="cache-control"> <meta content="no-cache" http-equiv="cache-control"/> <meta content="0" http-equiv="expires"/> <meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/> <meta content="no-cache" http-equiv="pragma"/> <meta content="10; url=/distil_r_captcha.html?Ref=/Residential/Map.aspx&distil_RID=0F54A1A0-DDB6-11E7-A7E7-8244BDED3E10&distil_TID=20171210142606" http-equiv="refresh"/> <script type="text/javascript"> (function(window){ try { if (typeof sessionStorage !== 'undefined'){ sessionStorage.setItem('distil_referrer', document.referrer); } } catch (e){} })(window); </script> <script defer="" src="/cndnrlsttdstl.js" type="text/javascript"></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#xwdaqqdutzds{display:none!important}</style></meta></head> <body> <div id="distil_ident_block"> </div> </body> </html>, '\n'] >>>

10th Dec 2017, 2:27 PM

susmitha

still shows like that above one.. same problem

10th Dec 2017, 2:52 PM

susmitha

please solve it.. for that website I want properties list... when I try it in webharvy it gives list of 9 but I want all list .. but I want it in code format... how webharvy works... Do you know the code behind webharvy??

10th Dec 2017, 2:57 PM

susmitha

I had already accessed to that website.. but when I try to parse.. the above html element shows..

10th Dec 2017, 3:08 PM

susmitha

yes i use normal browser.. and check the response status code as 200..

10th Dec 2017, 3:09 PM

susmitha

is scrappy solves this issue

10th Dec 2017, 3:18 PM

susmitha

is scrapy works in python3??