Regular Expression in python

Question

So, I started a text-mining project, I opened a pdf file, and of course headers are included in the text. I used regular expression before, but only for easy things. I know I can search with regex, so I would like to delete these strings from the text, like this: "...

\x0c1Lorem Ipsum

...". the only change is here: "\x0c2, \x0c12...". But there are 2 blank lines left between headlines and text as well. So I can't delete from

to

. So How can I delete only these parts from the text, I had to use the '*' character?

Accepted Answer

You can use re.sub function to replace the unneeded pattern with something (e.g. empty string).
There are also additional arguments you can apply for multiline text matching, like re.DOTALL and re.MULTILINE
https://www.thegeekstuff.com/2014/07/advanced-python-regex/

If you post your sample text inside '''multiline strings''' in a code block, it would be easier to give more concrete advice, your description is not clear enough.

Answer

Thank you very much! :) It works for me.

Regular Expression in python

Vous avez souvent des questions comme celle-ci ?