[英]Using BeautifulSoup with Python to parse page for attribute values
I am trying to use Python with BeautifulSoup to go through a page that has sections with ids that are incrementing in value by 1, and I am trying to get their vids. 我正在尝试将Python与BeautifulSoup一起使用,以浏览其ID值递增1的页面,并且试图获取其vid。 However the # of vids are variable depending on the span id as you can see below, also it is not nested under the original tr. 但是,vid的数量是可变的,具体取决于范围ID(如下所示),它也不嵌套在原始tr下。
Right now I am doing a loop to get the span id value, however I am trying to figure out a way to get the vid values as an array for each span id. 现在,我正在做一个循环以获取span ID值,但是我试图找出一种方法来获取vid值作为每个span id的数组。
The following is an example html I am working with: 以下是我正在使用的示例html:
<tr>
<td>
<div>
<span class="apple-font" id="001">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099882"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="002">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="003">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="004">
</div>
</td>
</tr>
<tr>
</tr>
The following is code I am using / have been trying to but have not made much progress yet on figuring out getting all the vids: 以下是我正在使用的代码/一直在尝试但未取得太多进展的所有代码:
soup = soup.findAll(class_="apple-font", id=True)
for s in soup:
n = str(s.get_text().lstrip().replace(".",""))
print n
print
I'd use an iterative approach; 我会使用迭代方法; loop over all tr
elements in the same table, starting from the first <span class="apple-font">
tag and start a new group each time you find a row with a new id
: 从第一个<span class="apple-font">
标记开始,循环遍历同一张表中的所有tr
元素,并在每次找到具有新id
的行时开始一个新组:
table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
id_span = tr.find(class_='apple-font', id=True)
if id_span is not None:
# new group
group = []
groups[id_span['id']] = group
else:
vid_link = tr.find('a', vid=True)
if vid_link is not None:
group.append(vid_link['vid'])
Demo: 演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
... <td>
... <div>
... <span class="apple-font" id="001">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099882"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="002">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="003">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="004">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
... id_span = tr.find(class_='apple-font', id=True)
... if id_span is not None:
... # new group
... group = []
... groups[id_span['id']] = group
... else:
... vid_link = tr.find('a', vid=True)
... if vid_link is not None:
... group.append(vid_link['vid'])
...
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.