简体   繁体   English

Beautiful Soup-在第一个指定标签之后抓取字符串

[英]Beautiful Soup - Grabbing the string after the first specified tag

I'm trying to grab the string immediately after the opening <td> tag. 我试图在打开<td>标记后立即抓取字符串。 The following code works: 以下代码有效:

webpage = urlopen(i).read()
soup = BeautifulSoup(webpage)
for elem in soup('td', text=re.compile(".\.doc")):
    print elem.parent

when the html looks like this: 当html看起来像这样时:

<td>plan_49913.doc</td>

but not when the html looks like this: 但是当html看起来像这样时却不是:

<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>

I've tried playing with attrs but can't get it to work. 我尝试过使用attrs,但无法正常工作。 Basically I just want to grab 'plan_49913.doc' in either instance of html. 基本上,我只想在任一html实例中获取“ plan_49913.doc”。

Any advice would be greatly appreciated. 任何建议将不胜感激。

Thank you in advance. 先感谢您。

~chrisK 〜克里斯

This works for me: 这对我有用:

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> soup = BeautifulSoup(html)
>>> soup.find(text=re.compile('.\.doc'))
u'plan_49913.doc

Is there something I'm missing? 有什么我想念的吗?

Also, note that according to the documentation: 另外,请注意,根据文档:

If you use text, then any values you give for name and the keyword arguments are ignored. 如果使用文本,则为名称和关键字参数提供的任何值都将被忽略。

So you don't need to pass 'td' since it's already being ignored, that is, any text that matches under any other tag will be returned. 因此,您无需传递'td'因为它已被忽略,也就是说,将返回在任何其他标签下匹配的所有文本。

Just use the next property, it contains the next node, and that's a textual node. 只需使用next属性,它包含下一个节点,那是一个文本节点。

>>> html = '<td>plan_49913.doc<br /> <font color="#990000">Document superseded by: &#160;</font><a href="/plans/Jan_2012.html">January 2012</a></td>'
>>> bs = BeautifulSoup(html)
>>> texts = [ node.next for node in bs.findAll('td') if node.next.endswith('.doc') ]
>>> texts
[u'plan_49913.doc']

you can change the if clause to use a regex if you prefer. 您可以根据需要将if子句更改为使用正则表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM