+ 1
Regex findall
I need help making this simple piece of code work. In the code there are 3 images and regex keeps finding 2. I know something is wrong with the pattern. I have tried ^ and $. It doesn't work. Help. Thanks https://code.sololearn.com/cd22U2AyX7ix/?ref=app
5 ответов
+ 4
you must make your '.*' non greedy by appending '?' in your regex:
pattern = re.compile(r'<img.*?/>')
+ 2
to be able to find multi-lines img, you should replace '.' by '[\s\S]':
pattern = re.compile(r'<img[\s\S]*?/>')
+ 1
import re
text =
"""
<p><img alt="" src="someurl.com" /></p>
<p><img alt="" src="someotherurl.com" /><img alt="" src="anotherurl.com" /></p>
"""
pattern = re.compile(r'<img.*?\n?/>')
all_of_em = re.findall(pattern, text)
print(all_of_em)
print(len(all_of_em))
"""
what is causing the unwanted output is that the last match starts in 2nd line and ends in 3rd line, so you have to take in consideration the new line character in your pattern. the interrogation mark or ? means with or without the previous character which means match with or without any character except new line represented in the dot sign and with or without new line represented in \n sign.
I hope you got the point.
"""
+ 1
Of course not iTech
Thanks guys.
+ 1
I realise this thread is finished and you have your answer, but I just thought I'd add another solution, because it's quite useful to know.
pattern = re.compile(r'<img[^>]*>')
...works because [^>] will match any character that isn't the tag-closing ">" character. Using this, you can't match more than one img tag in one go.