+ 2
How to overcome http error 405 in python webscraping??
18 odpowiedzi
+ 7
Maybe try providing the UserAgent variable for faking a web browser. Can you publish the code here?
+ 5
Error 405 means "Method not allowed". Probably you are using a method, while scraping, which is forbidden or disabled on the target server.
+ 5
Perhaps some kind of an authorization mechanism kicks in, like captcha or something. Could you please publish the code? It will be easier to see if something can be done about it...
+ 4
Well, evidently your request was recognized as a robot. Perhaps you could do better with using the urllib.request.Request rather than requests.get - I prepared a function like this:
from urllib.request import Request, urlopen
def scrape(link):
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
return str(urlopen(Request(link, headers = headers)).read())
except Exception as e:
print(str(e))
Check it out and see if you have the same problem or not.
+ 4
Yes, but the 200 status is returned already by the captcha page, this is why you can access it no problem.
Most likely the server recognizes your IP and serves you this page instead of the one you want to access in the first place.
I don't think there is anything you can do apart from changing the IP you access the page from. Or using a regular webscraper like Scrapy, which handles such cases quite well.
+ 3
Hmm... maybe try using a proper useragent string, like:
Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
Or see the module specially prepared for this:
https://pypi.python.org/pypi/user-agents
+ 3
Yes, scrapy works with Python3 alright. You can combine it with bs4 for parsing:
https://pypi.python.org/pypi/Scrapy
+ 2
Okay then, if you keep getting the same captcha page every time, it means your IP has been temporarily blocked by the website. Happened to me too, while trying to webscrape some google news. You can't really do anything with the code - try either using a proxy to change your IP or wait a while until they unblock you again.
By the way, if you try to access thise website with a regular browser (now!) -- does it work fine?
+ 1
then how to code to get that websites information??
0
by using user agent I have access to that website response status code 200.. But when I try to parse it shows little content and ends with distil_iden_block... and can't get information what I want... how to get that... can u help in that..
0
import bs4
import requests
from bs4 import BeautifulSoup as soup
url="http://www.realtor.ca"
headers={"user_agent":"mozilla"}
req=requests.get(url, headers=headers)
page=soup (req.text, "html. parser")
when I try to parse like this :
page.contents
it shows little amount of data..
when I see source code of that it shows large amount of code..
0
This is my code:
import requests
import bs4
from urllib.request import urlopen as ureq
from bs4 import BeautifulSoup as soup
my_url=("https://www.realtor.ca/Residential/Map.aspx#CultureId=1&ApplicationId=1&RecordsPerPage=9&MaximumResults=9&PropertySearchTypeId=1&TransactionTypeId=2&StoreyRange=0-0&BedRange=0-0&BathRange=0-0&LongitudeMin=-75.74395179748535&LongitudeMax=-75.66129684448242&LatitudeMin=45.403331468019175&LatitudeMax=45.43664692675851&SortOrder=A&SortBy=1&viewState=m&Longitude=-75.7026243209839&Latitude=45.4199916546208&ZoomLevel=14&PropertyTypeGroupID=1")
headers={"user-agent":"Mozilla/5.0 (Windows NT 6.1; rv:57.0) Gecko/20100101 Firefox/57.0"}
req=requests.get(my_url,headers=headers)
page_soup=soup(req.text,"html.parser")
and when i run it, it shows like this:
>> req
<Response [200]>
>>> page_soup.contents
['html', '\n', <html>
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="max-age=0" http-equiv="cache-control">
<meta content="no-cache" http-equiv="cache-control"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="no-cache" http-equiv="pragma"/>
<meta content="10; url=/distil_r_captcha.html?Ref=/Residential/Map.aspx&distil_RID=0F54A1A0-DDB6-11E7-A7E7-8244BDED3E10&distil_TID=20171210142606" http-equiv="refresh"/>
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script defer="" src="/cndnrlsttdstl.js" type="text/javascript"></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#xwdaqqdutzds{display:none!important}</style></meta></head>
<body>
<div id="distil_ident_block">Â </div>
</body>
</html>, '\n']
>>>
0
still shows like that above one.. same problem
0
please solve it.. for that website I want properties list...
when I try it in webharvy it gives list of 9 but I want all list .. but I want it in code format...
how webharvy works... Do you know the code behind webharvy??
0
I had already accessed to that website.. but when I try to parse.. the above html element shows..
0
yes i use normal browser.. and check the response status code as 200..
0
is scrappy solves this issue
0
is scrapy works in python3??