+ 1
How can we get urls(sitemap) from website, if it's a JSON website?
Direct me to write script for that, if possible give me a script, which fetches all the urls from JSON pages or else help me to fetch all the intermediate links.
1 Answer
+ 3
In Python, you should start from:
import json, urllib.parse, urllib.request
absolute_url = '(enter the proper json link here)'
url = absolute_url + urllib.parse.urlencode() # you should encode the url
data to be fetched:
data = urllib.request.urlopen(url).read().decode # remember to decode the encoded format
content in json format will be gathered by:
content = json.loads(str(data))
The rest is actually dependent on the json tree content and shape. You can traverse the tree by using brackets []
for example - if absoulute links are held under a tree of:
LINKS:
Name: Name1 - webpage name
Link: Link1 - link to be retrieved
Description: Description1 - some other data
Name: Name2
Link: Link2
Description: Description2
...
you can access the link by:
link = content["LINKS"]["Name"]["Link"]
of course if you don't know the number of links to retrieve, you should use the while True loop and catch the potential error of not retrieving the content. I recommend first trying to fetch all the content and print it to see how it's structured.