Finding &nbsp character using Regex in Python

Question

I have a web page that have been generated from a Word document (using save as from the word doc). It has generated some &nbsp characters.

Initially, I am using a Regex function to look for "2 General" within the generated HTML text. Here is a snippet where "2 General" is located:

2      General<o:p></o:p>")

This is the python code I used:

el1_search = "2 General"
el1_search = re.compile(el1_search.replace(' ', '[\s\u00A0]*'))
el1 = soup.find(text=el1_search)

The el1_search will then be replaced by a user input. (I think that) I do not have the option to find and replace the \ characters because I want to output the soup with modifications based on this search.

The user will then be able to make specific searches within the text. The results will then be used to wrap the parent table element in div with special attributes.

I do not seem to be able to find the element with the \ element. Can you kindly help me out?

Thanks!

Answer 1

You can use BeautifulSoup 's method .get_text() to get the string (it will handle automatically   etc.. for you).

For example:

from bs4 import BeautifulSoup

txt = '''<span style="font-size:9.5pt;font-family:&quot;Arial Black&quot;,sans-serif">2<span style="mso-spacerun:yes">&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="mso-spacerun:yes">&nbsp;</span>General<o:p></o:p></span>'''
soup = BeautifulSoup(txt, 'html.parser')

print(soup.find('span').get_text(strip=True, separator=' '))

Prints:

2 General

Finding &nbsp character using Regex in Python

Question

1 answers

solution1
0 2020-06-18 07:37:36

Finding &nbsp character using Regex in Python

Question

1 answers

solution1 0 2020-06-18 07:37:36

solution1
0 2020-06-18 07:37:36