简体   繁体   中英

How to match string , special characters and HTML tags in Regex?

I'm trying to match a HTML text mixed with some normal strings . I've already do most of the job , but the problem with the string inside the HTML chars .

So the text i'm trying to find would look like this :

>(\n(optional))</td>\n<td style="text-align:right">Text i want</td>\n

So the main problem is the optional part because it has \\n () and string , and all of it are optional .

what i've done so far is :

reg_num = r'></td>\\n<td style="text-align:right">.*?</td>\\n'
reg_num1 = r'(?<="\>).*?(?=\</)'
pattern = re.compile(reg_name)
pattern1 = re.compile(reg_num)
pattern2 = re.compile(reg_num1)
pup = re.findall(pattern1, str(html_text))
new_pup = re.findall(pattern2,str(pup))

What i did above is first found the text and then found the text i want . this code works fine for all the result which doesn't have the optional text within.

What should i add in order to get the matches when there is optional text too ?

Is there any better way to find the text with one line without dividing it ?

You should not use a regex to parse HTML, you should use a tool like XPath queries or css/jQuery selectors.

A package that allows you to parse HTML is BeautifulSoup . For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(str(html_text))
for td_tag in soup.find_all('td',{'style':'text-align:right'}):
    print(td_tag.text) #or do something else with the text

Here you parse it to a soup object, and the you iterate over all <td> tags that have an attribute style that is exactly "text-align:right" . Now for all these td_tag s, you print the .text (evidently you can do something else with it).

If you for instance want to construct a list of all these texts, you can use list comprehension :

from bs4 import BeautifulSoup

soup = BeautifulSoup(str(html_text))
all_texts = [td_tag.text for td_tag in soup.find_all('td',{'style':'text-align:right'})]

As you can see, here you specify what you want to extract, there is no need to write complex regexes that can easily fail or even are impossible to construct. One can easily read what you aim to extract.

我建议您使用 Python 的beautifulsoup包。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM