+ 2

Regular Expression in python

So, I started a text-mining project, I opened a pdf file, and of course headers are included in the text. I used regular expression before, but only for easy things. I know I can search with regex, so I would like to delete these strings from the text, like this: "...\n\n\x0c1Lorem Ipsum\n\n...". the only change is here: "\x0c2, \x0c12...". But there are 2 blank lines left between headlines and text as well. So I can't delete from \n\n to \n\n. So How can I delete only these parts from the text, I had to use the '*' character?

23rd Jan 2020, 6:43 PM
Lógó Péter
Lógó Péter - avatar
2 Respostas
+ 3
You can use re.sub function to replace the unneeded pattern with something (e.g. empty string). There are also additional arguments you can apply for multiline text matching, like re.DOTALL and re.MULTILINE https://www.thegeekstuff.com/2014/07/advanced-JUMP_LINK__&&__python__&&__JUMP_LINK-regex/ If you post your sample text inside '''multiline strings''' in a code block, it would be easier to give more concrete advice, your description is not clear enough.
23rd Jan 2020, 8:14 PM
Tibor Santa
Tibor Santa - avatar
+ 1
Thank you very much! :) It works for me.
24th Jan 2020, 3:05 PM
Lógó Péter
Lógó Péter - avatar