使用python BeautifulSoup从html中提取某些内容

Question

I have been trying to extract 我一直在尝试提取

Bacillus circulans

from following html: 来自以下html：

<tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br>
</div></td></tr>

but I am not sure which tag it is under and how to get into that tag. 但我不确定该标签位于哪个标签下以及如何进入该标签。

I would appreciate your help. 多谢您的协助。

Thank you, Xp 谢谢Xp

edit: I am actually trying to get bacillus circulans from KEGG addenlum page 编辑：我实际上是试图从KEGG附加页面获取细菌芽胞

import urllib
from bs4 import BeautifulSoup as BS

url = 'http://www.kegg.jp/entry/ag:CAA27061'


page = urllib.urlopen(url).read()


soup = BS(page, 'html.parser')

tags = soup('div')

for i in tags.contents:
        print i

Above is what I know how to do. 以上是我知道该怎么做。 Since there are more organisms to retrieve, I don't think I can use 're' to match a patter. 由于要检索的生物更多，我认为我不能使用“ re”来匹配模式。 I want to find a tag that associates with Addenlum org , and fetch the organism names 我想找到一个与Addenlum org关联的标签，并获取生物名称

Answer 1

from bs4 import BeautifulSoup as soup
html='''<tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br>
</div></td></tr>'''
html=soup(html)
print(html.text)

A simple way that prints 一种简单的打印方式

Organism
ag  Addendum (Bacillus circulans)

Then you can 那么你也能

print(html.text.split('(')[1].split(')')[0])

Which prints Bacillus circulans 哪些印刷了芽孢杆菌

Answer 2

You could do this using bs4 and regular expressions. 您可以使用bs4和正则表达式执行此操作。

BeautifulSoup Part BeautifulSoup部分

from bs4 import BeautifulSoup
h = """
<tr><th class="th10" align="left" valign="top" style="border-color:#000; 
border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr>
</th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; 
border-style: solid"><div style="width:555px;overflow-x:auto;overflow-
y:hidden"><a href="/kegg-bin/show_organism?
tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br>
</div></td></tr>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

Your content lies inside a <div> tag. 您的内容位于<div>标记内。

tag = soup.find('div')
t = tag.text #'ag\xa0\xa0Addendum (Bacillus circulans)\n'

Regular Expression Part 正则表达式部分

import re
m = re.match(('(.*)\((.*)\).*', t)
ans = m.group(2)  #Bacillus circulans

Answer 3

The usual preliminaries. 通常的预备。

>>> import bs4
>>> soup = bs4.BeautifulSoup('''\
... <tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th><td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br></div></td></tr>''', 'lxml')

Then I prettify the soup to see what I'm up against. 然后我prettify soup ，看看我要面对什么。

>>> for line in soup.prettify().split('\n'):
...     print(line)
... 
<html>
 <body>
  <tr>
   <th align="left" class="th10" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid" valign="top">
    <nobr>
     Organism
    </nobr>
   </th>
   <td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid">
    <div style="width:555px;overflow-x:auto;overflow-y:hidden">
     <a href="/kegg-bin/show_organism?tax=1397">
      ag
     </a>
     Addendum (Bacillus circulans)
     <br/>
    </div>
   </td>
  </tr>
 </body>
</html>

I can see that the string you want is one of three items that constitute the contents of a div element. 我可以看到您想要的字符串是构成div元素contents的三个项目之一。 My first step is to identify that element, and I use its style attribute. 我的第一步是识别该元素，然后使用其style属性。

>>> parentDiv = soup.find('div', attrs={"style":"width:555px;overflow-x:auto;overflow-y:hidden"})

I examine the three items in its contents , and I'm reminded that strings don't have a name ; 我研究了其contents的三个项，并提醒字符串没有name ; it's None . None 。

>>> for item in parentDiv.contents:
...     item, item.name
...     
(<a href="/kegg-bin/show_organism?tax=1397">ag</a>, 'a')
('\xa0\xa0Addendum (Bacillus circulans)', None)
(<br/>, 'br')

Then to isolate that string I can use: 然后要隔离该字符串，我可以使用：

>>> BC_string = [_ for _ in parentDiv.contents if not _.name]
>>> BC_string 
['\xa0\xa0Addendum (Bacillus circulans)']

Edit: Given information from comment, this is how to handle one page. 编辑：从评论中获得信息，这是处理一页的方法。 Find the heading for 'Organism' (in a nobr elment), then look for the div that contains the desired text relative to that element. 查找“生物”（在标题nobr elment），然后寻找div包含所需文本相对于该元素。 Filter out the string(s) from other elements that are contents of that div , then use a regex to obtain the parenthesised name of the organism. 从属于该div contents的其他元素中过滤出字符串，然后使用正则表达式获取该生物的括号名称。 If the regex fails then offer the whole string. 如果正则表达式失败，则提供整个字符串。

>>> import bs4
>>> import requests
>>> soup_2 = bs4.BeautifulSoup(requests.get('http://www.kegg.jp/entry/ag:CAA27061').content, 'lxml')
>>> organism = soup_2.find_all('nobr', string='Organism')
>>> parentDiv = organism[0].fetchParents()[0].fetchNextSiblings()[0].find_all('div')[0]
>>> desiredContent = [_.strip() for _ in parentDiv.contents if not _.name and _.strip()]
>>> if desiredContent:
...     m = bs4.re.match('[^\(]*\(([^\)]+)', desiredContent[0])
...     if m:
...         name = m.groups()[0]
...     else:
...         name = "Couldn't match content of " + desiredContent
...         
>>> name
'Bacillus circulans'

使用python BeautifulSoup从html中提取某些内容

问题描述

3 个解决方案

解决方案1
0 2017-06-10 18:51:05

解决方案2
0 2017-06-10 18:51:29

解决方案3
0 已采纳 2017-06-10 19:43:35

使用python BeautifulSoup从html中提取某些内容

问题描述

3 个解决方案

解决方案1 0 2017-06-10 18:51:05

解决方案2 0 2017-06-10 18:51:29

解决方案3 0 已采纳 2017-06-10 19:43:35

解决方案1
0 2017-06-10 18:51:05

解决方案2
0 2017-06-10 18:51:29

解决方案3
0 已采纳 2017-06-10 19:43:35