简体   繁体   中英

Finding &nbsp character using Regex in Python

I have a web page that have been generated from a Word document (using save as from the word doc). It has generated some &nbsp characters.

Initially, I am using a Regex function to look for "2 General" within the generated HTML text. Here is a snippet where "2 General" is located:

<span style="font-size:9.5pt;font-family:&quot;Arial Black&quot;,sans-serif">2<span style="mso-spacerun:yes">&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="mso-spacerun:yes">&nbsp;</span>General<o:p></o:p></span>")

This is the python code I used:

el1_search = "2 General"
el1_search = re.compile(el1_search.replace(' ', '[\s\u00A0]*'))
el1 = soup.find(text=el1_search)

The el1_search will then be replaced by a user input. (I think that) I do not have the option to find and replace the \  characters because I want to output the soup with modifications based on this search.

The user will then be able to make specific searches within the text. The results will then be used to wrap the parent table element in div with special attributes.

I do not seem to be able to find the element with the \  element. Can you kindly help me out?

Thanks!

You can use BeautifulSoup 's method .get_text() to get the string (it will handle automatically &nbsp; etc.. for you).

For example:

from bs4 import BeautifulSoup

txt = '''<span style="font-size:9.5pt;font-family:&quot;Arial Black&quot;,sans-serif">2<span style="mso-spacerun:yes">&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="mso-spacerun:yes">&nbsp;</span>General<o:p></o:p></span>'''
soup = BeautifulSoup(txt, 'html.parser')

print(soup.find('span').get_text(strip=True, separator=' '))

Prints:

2 General

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM