I'm trying to match a HTML text mixed with some normal strings . I've already do most of the job , but the problem with the string inside the HTML chars .
So the text i'm trying to find would look like this :
>(\n(optional))</td>\n<td style="text-align:right">Text i want</td>\n
So the main problem is the optional part because it has \\n () and string , and all of it are optional .
what i've done so far is :
reg_num = r'></td>\\n<td style="text-align:right">.*?</td>\\n'
reg_num1 = r'(?<="\>).*?(?=\</)'
pattern = re.compile(reg_name)
pattern1 = re.compile(reg_num)
pattern2 = re.compile(reg_num1)
pup = re.findall(pattern1, str(html_text))
new_pup = re.findall(pattern2,str(pup))
What i did above is first found the text and then found the text i want . this code works fine for all the result which doesn't have the optional text within.
What should i add in order to get the matches when there is optional text too ?
Is there any better way to find the text with one line without dividing it ?
You should not use a regex to parse HTML, you should use a tool like XPath queries or css/jQuery selectors.
A package that allows you to parse HTML is BeautifulSoup
. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(html_text))
for td_tag in soup.find_all('td',{'style':'text-align:right'}):
print(td_tag.text) #or do something else with the text
Here you parse it to a soup
object, and the you iterate over all <td>
tags that have an attribute style
that is exactly "text-align:right"
. Now for all these td_tag
s, you print the .text
(evidently you can do something else with it).
If you for instance want to construct a list of all these texts, you can use list comprehension :
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(html_text))
all_texts = [td_tag.text for td_tag in soup.find_all('td',{'style':'text-align:right'})]
As you can see, here you specify what you want to extract, there is no need to write complex regexes that can easily fail or even are impossible to construct. One can easily read what you aim to extract.
我建议您使用 Python 的beautifulsoup
包。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.