简体   繁体   中英

Simple question - searching for text in between <td> and </td> tags ignoring new lines

My question might be very simple. I scrape a webpage with BS. In the soup, I do want to search for a text (here: example).

Now, if in the soup the content looks like (excerpt):

<!DOCTYPE html>
<td>example</td>

it perfectly does the job and outputs the text (example).

However, in some occasions the content is:

<!DOCTYPE html>
<td>
   example
</td>

it does not find it. I guess it is due to the fact the text I search for is not squeezed between the <td> and <\td> tags.

The code I use is:

temp = soup.find(text = 'example')

Hope someone can answer this probably very basic question.

That's because in your second example, the text is no longer equal to example because it contains line breaks which are now part of the text. So your search has to change from "euqals" to "contains". And that, in the case of beautifulsoup, requires regex.

Assume this is your html:

test = """<!DOCTYPE html>
<doc>
<td>
   example
</td>
<td>example2</td>
<td>unrelated</td>
</doc>"""

Then you need to

import re

and finally

for entry in soup.find_all(text=re.compile("example")):
    print(entry.strip())

Output:

example
example2

As an aside, in more complicated html/xml and searches, using regex is not recommended. You may have to switch to a library like lxml.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM