[英]Extract a certain content from html using python BeautifulSoup
I have been trying to extract 我一直在尝试提取
Bacillus circulans
from following html: 来自以下html:
<tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a> Addendum (Bacillus circulans)<br>
</div></td></tr>
but I am not sure which tag it is under and how to get into that tag. 但我不确定该标签位于哪个标签下以及如何进入该标签。
I would appreciate your help. 多谢您的协助。
Thank you, Xp 谢谢Xp
edit: I am actually trying to get bacillus circulans from KEGG addenlum page 编辑:我实际上是试图从KEGG附加页面获取细菌芽胞
import urllib
from bs4 import BeautifulSoup as BS
url = 'http://www.kegg.jp/entry/ag:CAA27061'
page = urllib.urlopen(url).read()
soup = BS(page, 'html.parser')
tags = soup('div')
for i in tags.contents:
print i
Above is what I know how to do. 以上是我知道该怎么做。 Since there are more organisms to retrieve, I don't think I can use 're' to match a patter.
由于要检索的生物更多,我认为我不能使用“ re”来匹配模式。 I want to find a tag that associates with
Addenlum org
, and fetch the organism names 我想找到一个与
Addenlum org
关联的标签,并获取生物名称
from bs4 import BeautifulSoup as soup
html='''<tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a> Addendum (Bacillus circulans)<br>
</div></td></tr>'''
html=soup(html)
print(html.text)
A simple way that prints 一种简单的打印方式
Organism
ag Addendum (Bacillus circulans)
Then you can 那么你也能
print(html.text.split('(')[1].split(')')[0])
Which prints Bacillus circulans 哪些印刷了芽孢杆菌
You could do this using bs4 and regular expressions. 您可以使用bs4和正则表达式执行此操作。
BeautifulSoup Part BeautifulSoup部分
from bs4 import BeautifulSoup
h = """
<tr><th class="th10" align="left" valign="top" style="border-color:#000;
border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr>
</th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px;
border-style: solid"><div style="width:555px;overflow-x:auto;overflow-
y:hidden"><a href="/kegg-bin/show_organism?
tax=1397">ag</a> Addendum (Bacillus circulans)<br>
</div></td></tr>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
Your content lies inside a <div>
tag. 您的内容位于
<div>
标记内。
tag = soup.find('div')
t = tag.text #'ag\xa0\xa0Addendum (Bacillus circulans)\n'
Regular Expression Part 正则表达式部分
import re
m = re.match(('(.*)\((.*)\).*', t)
ans = m.group(2) #Bacillus circulans
The usual preliminaries. 通常的预备。
>>> import bs4
>>> soup = bs4.BeautifulSoup('''\
... <tr><th class="th10" align="left" valign="top" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid"><nobr>Organism</nobr></th><td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid"><div style="width:555px;overflow-x:auto;overflow-y:hidden"><a href="/kegg-bin/show_organism?tax=1397">ag</a> Addendum (Bacillus circulans)<br></div></td></tr>''', 'lxml')
Then I prettify
the soup
to see what I'm up against. 然后我
prettify
soup
,看看我要面对什么。
>>> for line in soup.prettify().split('\n'):
... print(line)
...
<html>
<body>
<tr>
<th align="left" class="th10" style="border-color:#000; border-width: 1px 0px 0px 1px; border-style: solid" valign="top">
<nobr>
Organism
</nobr>
</th>
<td class="td10" style="border-color:#000; border-width: 1px 1px 0px 1px; border-style: solid">
<div style="width:555px;overflow-x:auto;overflow-y:hidden">
<a href="/kegg-bin/show_organism?tax=1397">
ag
</a>
Addendum (Bacillus circulans)
<br/>
</div>
</td>
</tr>
</body>
</html>
I can see that the string you want is one of three items that constitute the contents
of a div
element. 我可以看到您想要的字符串是构成
div
元素contents
的三个项目之一。 My first step is to identify that element, and I use its style
attribute. 我的第一步是识别该元素,然后使用其
style
属性。
>>> parentDiv = soup.find('div', attrs={"style":"width:555px;overflow-x:auto;overflow-y:hidden"})
I examine the three items in its contents
, and I'm reminded that strings don't have a name
; 我研究了其
contents
的三个项,并提醒字符串没有name
; it's None
. None
。
>>> for item in parentDiv.contents:
... item, item.name
...
(<a href="/kegg-bin/show_organism?tax=1397">ag</a>, 'a')
('\xa0\xa0Addendum (Bacillus circulans)', None)
(<br/>, 'br')
Then to isolate that string I can use: 然后要隔离该字符串,我可以使用:
>>> BC_string = [_ for _ in parentDiv.contents if not _.name]
>>> BC_string
['\xa0\xa0Addendum (Bacillus circulans)']
Edit: Given information from comment, this is how to handle one page. 编辑:从评论中获得信息,这是处理一页的方法。 Find the heading for 'Organism' (in a
nobr
elment), then look for the div
that contains the desired text relative to that element. 查找“生物”(在标题
nobr
elment),然后寻找div
包含所需文本相对于该元素。 Filter out the string(s) from other elements that are contents
of that div
, then use a regex to obtain the parenthesised name of the organism. 从属于该
div
contents
的其他元素中过滤出字符串,然后使用正则表达式获取该生物的括号名称。 If the regex fails then offer the whole string. 如果正则表达式失败,则提供整个字符串。
>>> import bs4
>>> import requests
>>> soup_2 = bs4.BeautifulSoup(requests.get('http://www.kegg.jp/entry/ag:CAA27061').content, 'lxml')
>>> organism = soup_2.find_all('nobr', string='Organism')
>>> parentDiv = organism[0].fetchParents()[0].fetchNextSiblings()[0].find_all('div')[0]
>>> desiredContent = [_.strip() for _ in parentDiv.contents if not _.name and _.strip()]
>>> if desiredContent:
... m = bs4.re.match('[^\(]*\(([^\)]+)', desiredContent[0])
... if m:
... name = m.groups()[0]
... else:
... name = "Couldn't match content of " + desiredContent
...
>>> name
'Bacillus circulans'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.