0
Please how do I write a lyric scraping webapp
Please I'm a newbie in python programming language and I was asked to work on a series of mini projects to help get more knowledge on the language but I ran into an error on this one, I don't know why I can't get it right. This is my code https://code.sololearn.com/cn5gxsRWPfCY/?ref=app
7 odpowiedzi
+ 2
For reason unknown None is assigned to "result' in line 21, but it does return 4 elements if running in IDLE.
I didn't use html.content before, I used html.text and my code grabs part of the lyric in IDLE / command line.
https://code.sololearn.com/cx9wbGTXzuk3/?ref=app
With a closer look the lyric is divided into 3 parts, which also shown in my code.
And I found out it is always "kUgSbL" when I open the page.
If we dig the html code deeper, lyric is inside <div id="lyrics-root">, and the sections contain lyric has a <<data-lyrics-container="true">> marking, in the respected index 2, 5, and 8 in this example.
Maybe you can write a litte function to scan the whole html code and locate where the lyric is stored, and then pass this info to the grabber function.
+ 3
Are you sure that the class names you use, for example "kUgSbL" are consistently the same every time you open the page? The could be some intended randomization, exactly for the purpose to make it more difficult for robots to process the page. Maybe you should look for a structural approach to find your intended tags.
+ 3
Hello Taripretei Z. Segu-Baghabo
First of all: I am definitely not an expert when it comes to web scraping.
But as far as I can see, your request has been rejected.
print(html.status_code) #output 403, but must be 200
You probably know the code 404 for page not found. 403 = forbidden, which means that you have not been granted access.
I would assume that you are classified as a bot if you want to read this page with python.
This may already help you:
https://stackoverflow.com/questions/70718774/how-to-resolve-the-human-or-bot-when-scraping-web-using-beautifulsoup
And you probably also need to handle cookies.
+ 3
Taripretei Z. Segu-Baghabo
The page you are trying to access have a verification script. Your code cannot get past this point. As Denise Roßberg and Tibor Santa suggested, this is a bot blocker.
When web scraping, it is probably a good idea to inspect the result of your request.
import requests
from bs4 import BeautifulSoup
url = "https://genius.com/Dax-gods-eyes-lyrics"
html = requests.get(url)
print ("hello")
#print (html.content)
s = BeautifulSoup(html.content, "html.parser")
print(s.prettify())
print ("done")
+ 3
Wong Hei Ming
good try with a fake browser id. That seems to get through the first layer protection script.
But the lyrics seem to be encrypted/minified and loaded through javascript...
Its a dynamic page, so it is particularly hard to scrape data. Running javascript within Python is not a trivial thing...
headless browser scraping seems to be the most suggested method.
+ 2
Taripretei Z. Segu-Baghabo
you could probably use browser automation to get into protected sites like these.
here is a link I Googled up on how to use it
https://www.zenrows.com/blog/selenium-JUMP_LINK__&&__python__&&__JUMP_LINK-web-scraping