I am fairly new to Python and to regular expressions, and am looking to extract information from an html file.
Assume the following is a line given in the html file (since html doesn't "see" whitespace, our example is on the same line)
<td (some possible parameters)> EXTRACT_THIS </td> <td (some possible parameters)> ALSO_EXTRACT_THIS </td>
In my current code:
with open(myInput, 'r') as inputFile:
for line in inputFile:
line = line.strip()
if line != '':
m = re.findall('<td.*>(.*?)</td>', line)
if m:
#strip() again
print(m)
This will only print:
['ALSO_EXTRACT_THIS']
instead of my desired
['EXTRACT_THIS', 'ALSO_EXTRACT_THIS']
Is there something I am doing wrong? I've looked into it and it seems that this is a way for to extract multiple substrings with repeating delimiters.
This is because the <td.*>
will match the longest string it can, which is <td (some possible parameters)> EXTRACT_THIS </td> <td (some possible parameters)>
.
You should use non-greedy quantifier for the <td>
too:
'<td.*?>(.*?)</td>'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.