+ 1

How many of you can help me in Python for a Data Science Project, where I am using libraries like Pandas, Matplotlib, Numpy ?

Guys, As we all know the latest industry is facing inflation and many people in India have lost their jobs. Apart from this, we all know that the field of Data Science is also emerging. With an global increase in data, they need many Data Analysis engineers. And best language to do it is R or Python. I need your help to sort me out with a problem. I need to import an .html file in JupyterNotebook . Later, I need to sort the certain strings (some words) and arrange the entire .html file back into dataframe.

html python python3 pandas project numpy matplotlib dataScience machine-learning sci-learn

3rd Sep 2017, 4:37 AM

Sahil Pole

5 Answers

+ 10

What you need is a good HTML parser. I often use BeautifulSoup (bs4 module). It parses a website to a tree-like structure through which you can navigate tag by tag, up and down, next and previous, child/parent/sibling, tag names, attributes and contents. It is relatively easy to handle and powerful enough to extract anything from the .html file. Later on, when you extract what you want, you sort it easily and convert to a pandas DataFrame. It is really capable of taking in any datatype inside it, so you should have no problem.

3rd Sep 2017, 4:56 AM

Kuba Siekierzyński

+ 4

Aha, so you only want to read it without parsing? Then pandas builtin should be enough I guess. I use bs4 to decompose HTML tag tree, extract and pull data and save it for my own needs, not another HTML. If just sorting, then sure read_html will be enough.

3rd Sep 2017, 10:28 AM

Kuba Siekierzyński

+ 3

@Kuba is it HTML in Python or is it pure HTML an nowherr near Python?

3rd Sep 2017, 5:00 AM

👑 Prometheus 🇸🇬

@Kuba_Sierkierzynski Thanks friend. But using pandas, we can import .html file as pandas.read_html(xyz.html) . But what is problem is it is inported as a list. and then it is giving problems to convert it back into Pandas Dataframe. I have used regex function to sort amd serach strings.

3rd Sep 2017, 4:59 AM

Sahil Pole

@vengat It is using python. We can import various files such as excel sheets, webpages, .html files , etc. Python provides us many libraries and functions to do so. Thus, It is using python and no HtMl.

3rd Sep 2017, 10:09 AM

Sahil Pole