+ 1

I'm learning a bit of web scraping with Python, how, why and when to use regex and or bs4 to parse html pages

web scraping with Python

24th Oct 2017, 4:43 AM

Donald Chinhuru

6 Respuestas

+ 12

BeautifulSoup4 or bs4 allows you to strip and decompose the website to its building elements, so it deals with the HTML from the structure side - if at any point you need to deal with, let's say, the sixth hyperlink in the second row, third column of the table preceeded with two <div> tags. RegEx or re allows you to parse text itself, looking for character-specific matches inside it. So it finds even the most complex character combinations in the HTML file, kind of treating it like a text file. For most parsing tasks you probably need both :)

24th Oct 2017, 5:41 AM

Kuba Siekierzyński