简体   繁体   中英

Regex try and match until hitting end tag in python

I'm looking for a bit of help with a regex in python and google is failing me. Basically I'm searching some html and there is a certain type of table I'm searching for, specifically any table that includes a background tag in it (ie BGCOLOR). Some tables have this tag and some do not. Could someone help me out with how to write a regex that searches for the start of the table, then searches for the BGCOLOR but if it hits the end of the table then it stops and moves on?

Here's a very simplified example that will server the purpose:

`<TABLE>
<B>Item 1.</B>
</TABLE>

<TABLE>
BGCOLOR
</TABLE>

<TABLE>
<B>Item 2.</B>
</TABLE>`

So we have three tables but I'm only interested in finding the middle table that contains 'BGCOLOR' The problem with my regex at the moment is that it searches for the starting table tag then looks for 'BGCOLOR' and doesn't care if it passes the table end tag:

tables = re.findall('\<table.*?BGCOLOR=".*?".*?\<\/table\>', text, re.I|re.S)

So it would find the first two tables instead of just the second table. Let me know if anyone knows how to handle this situation.

Thanks, Michael

Don't use a regular expression to parse HTML. Use lxml or BeautifulSoup .

Don't use regular expressions to parse HTML -- use an HTML parser, such as BeautifulSoup .

Specifically, your situation is basically one of having to deal with "nested parentheses" (where an open "parens" is an opening <table> tag and the corresponding closed parens is the matching </table> ) -- exactly the kind of parsing tasks that regular expressions can't perform well. Lots of the work in parsing HTML is exactly connected with this "matched parentheses" issue, which makes regular expressions a perfectly horrible choice for the purpose.

You mention in a comment to another answer that you've had unspecified problems with BS -- I suspect you were trying the latest, 3.1 release (which has gone downhill) instead of the right one; try 3.0.8 instead, as BS's own docs recommend, and you could be better off.

If you've made some kind of pact with Evil never to use the right tool for the job, your task might not be totally impossible if you don't need to deal with nesting (just matching), ie, there is never a table inside another table. In this case you can identify one table with r'<\\s*TABLE(.*?)<\\s*/\\s*TABLE' (with suitable flags such as re.DOTALL and re.I ); loop over all such matches with the finditer method of regular expressions; and in the loop's body check whether BGCOLOR (in a case-insensitive sense) happens to be inside the body of the current match. It's still going to be more fragile, and more work, than using an HTML parser, but while definitely an inferior choice it needs not be a desperate situation.

If you do have nested tables to contend with, then it is a desperate situation.

if your task is just this simple, here's a way. split on <TABLE> then iterate the items and find the required pattern you want.

myhtml="""
<TABLE>
<B>Item 1.</B>
</TABLE>

some text1
some text2
some text3

<TABLE>
blah
BGCOLOR
blah
</TABLE>

some texet
<TABLE>
<B>Item 2.</B>
</TABLE>
"""

for tab in myhtml.split("</TABLE>"):
    if "<TABLE>" in tab and "BGCOLOR" in tab:
        print ''.join(tab.split("<TABLE>")[1:])

output

$ ./python.py

blah
BGCOLOR
blah

Here's the code that ended up working for me. It finds the correct table and adds more tagging around it so that it is identified from the group with open and close tags of 'realTable'.

soup = BeautifulSoup(''.join(text))
for p in soup.findAll('table'):
    pattern = '.*BGCOLOR.*'
    if (re.match(pattern, str(p), re.S|re.I)):
        tags = Tag(soup, "realTable")
        p.replaceWith(tags)
        text = NavigableString(str(p))
        tags.insert(0, text)
print soup

prints this out:

<table><b>Item 1.</b></table>
<realTable><table>blah BGCOLOR blah</table></realTable>
<table><b>Item 2.</b></table>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM