+ 2
Regular Expression in python
So, I started a text-mining project, I opened a pdf file, and of course headers are included in the text. I used regular expression before, but only for easy things. I know I can search with regex, so I would like to delete these strings from the text, like this: "...\n\n\x0c1Lorem Ipsum\n\n...". the only change is here: "\x0c2, \x0c12...". But there are 2 blank lines left between headlines and text as well. So I can't delete from \n\n to \n\n. So How can I delete only these parts from the text, I had to use the '*' character?
2 ответов
+ 3
You can use re.sub function to replace the unneeded pattern with something (e.g. empty string).
There are also additional arguments you can apply for multiline text matching, like re.DOTALL and re.MULTILINE
https://www.thegeekstuff.com/2014/07/advanced-JUMP_LINK__&&__python__&&__JUMP_LINK-regex/
If you post your sample text inside '''multiline strings''' in a code block, it would be easier to give more concrete advice, your description is not clear enough.
+ 1
Thank you very much! :) It works for me.