Simple question - searching for text in between <td> and </td> tags ignoring new lines

Question

My question might be very simple. I scrape a webpage with BS. In the soup, I do want to search for a text (here: example).

Now, if in the soup the content looks like (excerpt):

<!DOCTYPE html>
<td>example</td>

it perfectly does the job and outputs the text (example).

However, in some occasions the content is:

<!DOCTYPE html>
<td>
   example
</td>

it does not find it. I guess it is due to the fact the text I search for is not squeezed between the <td> and <\td> tags.

The code I use is:

temp = soup.find(text = 'example')

Hope someone can answer this probably very basic question.

Answer 1

That's because in your second example, the text is no longer equal to example because it contains line breaks which are now part of the text. So your search has to change from "euqals" to "contains". And that, in the case of beautifulsoup, requires regex.

Assume this is your html:

test = """<!DOCTYPE html>
<doc>
<td>
   example
</td>
<td>example2</td>
<td>unrelated</td>
</doc>"""

Then you need to

import re

and finally

for entry in soup.find_all(text=re.compile("example")):
    print(entry.strip())

Output:

example
example2

As an aside, in more complicated html/xml and searches, using regex is not recommended. You may have to switch to a library like lxml.

Simple question - searching for text in between <td> and </td> tags ignoring new lines

Question

1 answers

solution1
0 ACCPTED 2021-05-05 12:34:56

Simple question - searching for text in between <td> and </td> tags ignoring new lines

Question

1 answers

solution1 0 ACCPTED 2021-05-05 12:34:56

solution1
0 ACCPTED 2021-05-05 12:34:56