0

Please how do I write a lyric scraping webapp

Please I'm a newbie in python programming language and I was asked to work on a series of mini projects to help get more knowledge on the language but I ran into an error on this one, I don't know why I can't get it right. This is my code https://code.sololearn.com/cn5gxsRWPfCY/?ref=app

27th Nov 2023, 10:33 AM
Taripretei Z. Segu-Baghabo
Taripretei Z. Segu-Baghabo - avatar
7 odpowiedzi
+ 2
For reason unknown None is assigned to "result' in line 21, but it does return 4 elements if running in IDLE. I didn't use html.content before, I used html.text and my code grabs part of the lyric in IDLE / command line. https://code.sololearn.com/cx9wbGTXzuk3/?ref=app With a closer look the lyric is divided into 3 parts, which also shown in my code. And I found out it is always "kUgSbL" when I open the page. If we dig the html code deeper, lyric is inside <div id="lyrics-root">, and the sections contain lyric has a <<data-lyrics-container="true">> marking, in the respected index 2, 5, and 8 in this example. Maybe you can write a litte function to scan the whole html code and locate where the lyric is stored, and then pass this info to the grabber function.
28th Nov 2023, 2:46 AM
Wong Hei Ming
Wong Hei Ming - avatar
+ 3
Are you sure that the class names you use, for example "kUgSbL" are consistently the same every time you open the page? The could be some intended randomization, exactly for the purpose to make it more difficult for robots to process the page. Maybe you should look for a structural approach to find your intended tags.
27th Nov 2023, 6:12 PM
Tibor Santa
Tibor Santa - avatar
+ 3
Hello Taripretei Z. Segu-Baghabo First of all: I am definitely not an expert when it comes to web scraping. But as far as I can see, your request has been rejected. print(html.status_code) #output 403, but must be 200 You probably know the code 404 for page not found. 403 = forbidden, which means that you have not been granted access. I would assume that you are classified as a bot if you want to read this page with python. This may already help you: https://stackoverflow.com/questions/70718774/how-to-resolve-the-human-or-bot-when-scraping-web-using-beautifulsoup And you probably also need to handle cookies.
27th Nov 2023, 6:36 PM
Denise Roßberg
Denise Roßberg - avatar
+ 3
Taripretei Z. Segu-Baghabo The page you are trying to access have a verification script. Your code cannot get past this point. As Denise Roßberg and Tibor Santa suggested, this is a bot blocker. When web scraping, it is probably a good idea to inspect the result of your request. import requests from bs4 import BeautifulSoup url = "https://genius.com/Dax-gods-eyes-lyrics" html = requests.get(url) print ("hello") #print (html.content) s = BeautifulSoup(html.content, "html.parser") print(s.prettify()) print ("done")
27th Nov 2023, 10:33 PM
Bob_Li
Bob_Li - avatar
+ 3
Wong Hei Ming good try with a fake browser id. That seems to get through the first layer protection script. But the lyrics seem to be encrypted/minified and loaded through javascript... Its a dynamic page, so it is particularly hard to scrape data. Running javascript within Python is not a trivial thing... headless browser scraping seems to be the most suggested method.
28th Nov 2023, 3:52 AM
Bob_Li
Bob_Li - avatar
+ 2
Taripretei Z. Segu-Baghabo you could probably use browser automation to get into protected sites like these. here is a link I Googled up on how to use it https://www.zenrows.com/blog/selenium-JUMP_LINK__&&__python__&&__JUMP_LINK-web-scraping
27th Nov 2023, 11:26 PM
Bob_Li
Bob_Li - avatar