+ 2

How to overcome http error 405 in python webscraping??

9th Dec 2017, 6:52 PM
susmitha
susmitha - avatar
18 odpowiedzi
+ 7
Maybe try providing the UserAgent variable for faking a web browser. Can you publish the code here?
10th Dec 2017, 8:13 AM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 5
Error 405 means "Method not allowed". Probably you are using a method, while scraping, which is forbidden or disabled on the target server.
9th Dec 2017, 7:14 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 5
Perhaps some kind of an authorization mechanism kicks in, like captcha or something. Could you please publish the code? It will be easier to see if something can be done about it...
10th Dec 2017, 1:02 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 4
Well, evidently your request was recognized as a robot. Perhaps you could do better with using the urllib.request.Request rather than requests.get - I prepared a function like this: from urllib.request import Request, urlopen def scrape(link): try: headers = {} headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36" return str(urlopen(Request(link, headers = headers)).read()) except Exception as e: print(str(e)) Check it out and see if you have the same problem or not.
10th Dec 2017, 2:41 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 4
Yes, but the 200 status is returned already by the captcha page, this is why you can access it no problem. Most likely the server recognizes your IP and serves you this page instead of the one you want to access in the first place. I don't think there is anything you can do apart from changing the IP you access the page from. Or using a regular webscraper like Scrapy, which handles such cases quite well.
10th Dec 2017, 3:13 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 3
Hmm... maybe try using a proper useragent string, like: Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 Or see the module specially prepared for this: https://pypi.python.org/pypi/user-agents
10th Dec 2017, 1:36 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 3
Yes, scrapy works with Python3 alright. You can combine it with bs4 for parsing: https://pypi.python.org/pypi/Scrapy
10th Dec 2017, 3:51 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 2
Okay then, if you keep getting the same captcha page every time, it means your IP has been temporarily blocked by the website. Happened to me too, while trying to webscrape some google news. You can't really do anything with the code - try either using a proxy to change your IP or wait a while until they unblock you again. By the way, if you try to access thise website with a regular browser (now!) -- does it work fine?
10th Dec 2017, 3:03 PM
Kuba SiekierzyƄski
Kuba SiekierzyƄski - avatar
+ 1
then how to code to get that websites information??
10th Dec 2017, 3:11 AM
susmitha
susmitha - avatar
0
by using user agent I have access to that website response status code 200.. But when I try to parse it shows little content and ends with distil_iden_block... and can't get information what I want... how to get that... can u help in that..
10th Dec 2017, 12:22 PM
susmitha
susmitha - avatar
0
import bs4 import requests from bs4 import BeautifulSoup as soup url="http://www.realtor.ca" headers={"user_agent":"mozilla"} req=requests.get(url, headers=headers) page=soup (req.text, "html. parser") when I try to parse like this : page.contents it shows little amount of data.. when I see source code of that it shows large amount of code..
10th Dec 2017, 1:09 PM
susmitha
susmitha - avatar
0
This is my code: import requests import bs4 from urllib.request import urlopen as ureq from bs4 import BeautifulSoup as soup my_url=("https://www.realtor.ca/Residential/Map.aspx#CultureId=1&ApplicationId=1&RecordsPerPage=9&MaximumResults=9&PropertySearchTypeId=1&TransactionTypeId=2&StoreyRange=0-0&BedRange=0-0&BathRange=0-0&LongitudeMin=-75.74395179748535&LongitudeMax=-75.66129684448242&LatitudeMin=45.403331468019175&LatitudeMax=45.43664692675851&SortOrder=A&SortBy=1&viewState=m&Longitude=-75.7026243209839&Latitude=45.4199916546208&ZoomLevel=14&PropertyTypeGroupID=1") headers={"user-agent":"Mozilla/5.0 (Windows NT 6.1; rv:57.0) Gecko/20100101 Firefox/57.0"} req=requests.get(my_url,headers=headers) page_soup=soup(req.text,"html.parser") and when i run it, it shows like this: >> req <Response [200]> >>> page_soup.contents ['html', '\n', <html> <head> <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/> <meta content="max-age=0" http-equiv="cache-control"> <meta content="no-cache" http-equiv="cache-control"/> <meta content="0" http-equiv="expires"/> <meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/> <meta content="no-cache" http-equiv="pragma"/> <meta content="10; url=/distil_r_captcha.html?Ref=/Residential/Map.aspx&amp;distil_RID=0F54A1A0-DDB6-11E7-A7E7-8244BDED3E10&amp;distil_TID=20171210142606" http-equiv="refresh"/> <script type="text/javascript"> (function(window){ try { if (typeof sessionStorage !== 'undefined'){ sessionStorage.setItem('distil_referrer', document.referrer); } } catch (e){} })(window); </script> <script defer="" src="/cndnrlsttdstl.js" type="text/javascript"></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#xwdaqqdutzds{display:none!important}</style></meta></head> <body> <div id="distil_ident_block"> </div> </body> </html>, '\n'] >>>
10th Dec 2017, 2:27 PM
susmitha
susmitha - avatar
0
still shows like that above one.. same problem
10th Dec 2017, 2:52 PM
susmitha
susmitha - avatar
0
please solve it.. for that website I want properties list... when I try it in webharvy it gives list of 9 but I want all list .. but I want it in code format... how webharvy works... Do you know the code behind webharvy??
10th Dec 2017, 2:57 PM
susmitha
susmitha - avatar
0
I had already accessed to that website.. but when I try to parse.. the above html element shows..
10th Dec 2017, 3:08 PM
susmitha
susmitha - avatar
0
yes i use normal browser.. and check the response status code as 200..
10th Dec 2017, 3:09 PM
susmitha
susmitha - avatar
0
is scrappy solves this issue
10th Dec 2017, 3:18 PM
susmitha
susmitha - avatar
0
is scrapy works in python3??
10th Dec 2017, 3:19 PM
susmitha
susmitha - avatar