简体   繁体   English

使用python BeautifulSoup从html中提取某些内容

[英]Extract a certain content from html using python BeautifulSoup

I have been trying to extract 我一直在尝试提取

Bacillus circulans

from following html: 来自以下html:

<tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br>
</div></td></tr>

but I am not sure which tag it is under and how to get into that tag. 但我不确定该标签位于哪个标签下以及如何进入该标签。

I would appreciate your help. 多谢您的协助。

Thank you, Xp 谢谢Xp

edit: I am actually trying to get bacillus circulans from KEGG addenlum page 编辑:我实际上是试图从KEGG附加页面获取细菌芽胞

import urllib
from bs4 import BeautifulSoup as BS

url = 'http://www.kegg.jp/entry/ag:CAA27061'


page = urllib.urlopen(url).read()


soup = BS(page, 'html.parser')

tags = soup('div')

for i in tags.contents:
        print i

Above is what I know how to do. 以上是我知道该怎么做。 Since there are more organisms to retrieve, I don't think I can use 're' to match a patter. 由于要检索的生物更多,我认为我不能使用“ re”来匹配模式。 I want to find a tag that associates with Addenlum org , and fetch the organism names 我想找到一个与Addenlum org关联的标签,并获取生物名称

from bs4 import BeautifulSoup as soup
html='''<tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br>
</div></td></tr>'''
html=soup(html)
print(html.text)

A simple way that prints 一种简单的打印方式

Organism
ag  Addendum (Bacillus circulans)

Then you can 那么你也能

print(html.text.split('(')[1].split(')')[0])

Which prints Bacillus circulans 哪些印刷了芽孢杆菌

You could do this using bs4 and regular expressions. 您可以使用bs4和正则表达式执行此操作。

BeautifulSoup Part BeautifulSoup部分

from bs4 import BeautifulSoup
h = """
<tr><th class="th10" align="left" valign="top" style="border-color:#000; 
border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr>
</th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; 
border-style: solid"><div style="width:555px;overflow-x:auto;overflow-
y:hidden"><a href="/kegg-bin/show_organism?
tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br>
</div></td></tr>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

Your content lies inside a <div> tag. 您的内容位于<div>标记内。

tag = soup.find('div')
t = tag.text #'ag\xa0\xa0Addendum (Bacillus circulans)\n'

Regular Expression Part 正则表达式部分

import re
m = re.match(('(.*)\((.*)\).*', t)
ans = m.group(2)  #Bacillus circulans

The usual preliminaries. 通常的预备。

>>> import bs4
>>> soup = bs4.BeautifulSoup('''\
... <tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th><td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a>&nbsp;&nbsp;Addendum (Bacillus circulans)<br></div></td></tr>''', 'lxml')

Then I prettify the soup to see what I'm up against. 然后我prettify soup ,看看我要面对什么。

>>> for line in soup.prettify().split('\n'):
...     print(line)
... 
<html>
 <body>
  <tr>
   <th align="left" class="th10" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid" valign="top">
    <nobr>
     Organism
    </nobr>
   </th>
   <td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid">
    <div style="width:555px;overflow-x:auto;overflow-y:hidden">
     <a href="/kegg-bin/show_organism?tax=1397">
      ag
     </a>
     Addendum (Bacillus circulans)
     <br/>
    </div>
   </td>
  </tr>
 </body>
</html>

I can see that the string you want is one of three items that constitute the contents of a div element. 我可以看到您想要的字符串是构成div元素contents的三个项目之一。 My first step is to identify that element, and I use its style attribute. 我的第一步是识别该元素,然后使用其style属性。

>>> parentDiv = soup.find('div', attrs={"style":"width:555px;overflow-x:auto;overflow-y:hidden"})

I examine the three items in its contents , and I'm reminded that strings don't have a name ; 我研究了其contents的三个项,并提醒字符串没有name ; it's None . None

>>> for item in parentDiv.contents:
...     item, item.name
...     
(<a href="/kegg-bin/show_organism?tax=1397">ag</a>, 'a')
('\xa0\xa0Addendum (Bacillus circulans)', None)
(<br/>, 'br')

Then to isolate that string I can use: 然后要隔离该字符串,我可以使用:

>>> BC_string = [_ for _ in parentDiv.contents if not _.name]
>>> BC_string 
['\xa0\xa0Addendum (Bacillus circulans)']

Edit: Given information from comment, this is how to handle one page. 编辑:从评论中获得信息,这是处理一页的方法。 Find the heading for 'Organism' (in a nobr elment), then look for the div that contains the desired text relative to that element. 查找“生物”(在标题nobr elment),然后寻找div包含所需文本相对于该元素。 Filter out the string(s) from other elements that are contents of that div , then use a regex to obtain the parenthesised name of the organism. 从属于该div contents的其他元素中过滤出字符串,然后使用正则表达式获取该生物的括号名称。 If the regex fails then offer the whole string. 如果正则表达式失败,则提供整个字符串。

>>> import bs4
>>> import requests
>>> soup_2 = bs4.BeautifulSoup(requests.get('http://www.kegg.jp/entry/ag:CAA27061').content, 'lxml')
>>> organism = soup_2.find_all('nobr', string='Organism')
>>> parentDiv = organism[0].fetchParents()[0].fetchNextSiblings()[0].find_all('div')[0]
>>> desiredContent = [_.strip() for _ in parentDiv.contents if not _.name and _.strip()]
>>> if desiredContent:
...     m = bs4.re.match('[^\(]*\(([^\)]+)', desiredContent[0])
...     if m:
...         name = m.groups()[0]
...     else:
...         name = "Couldn't match content of " + desiredContent
...         
>>> name
'Bacillus circulans'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Python(Beautifulsoup)从html提取列 - Extract Columns from html using Python (Beautifulsoup) 使用python BeautifulSoup从HTML删除具有特定ID内容的特定标签 - Delete a certain tag with a certain id content from an HTML using python BeautifulSoup Python:需要使用正则表达式从 html 页面提取标签内容,但不是 BeautifulSoup - Python: Need to extract tag content from html page using regex, but not BeautifulSoup 如何使用 beautifulsoup 从 html 内容中提取标签 - How to extract tags from html content using beautifulsoup 如何使用 python 中的 BeautifulSoup package 从网站中提取 href 内容 - how to extract a href content from a website using BeautifulSoup package in python 使用beautifulsoup python从标记中提取html数据 - Extract html data from tags using beautifulsoup python 如何在 Python 中使用 BeautifulSoup 从 html 中提取特定文本? - How to extract specific text from html using BeautifulSoup in Python? Python:如何使用BeautifulSoup从HTML页面中提取URL? - Python: How to extract URL from HTML Page using BeautifulSoup? 如何在Python中使用BeautifulSoup从HTML页面提取表内容? - How to extract Table contents from an HTML page using BeautifulSoup in Python? 如何使用BeautifulSoup(Python)从HTML标签提取文本? - How to extract text from HTML label using BeautifulSoup (Python)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM