[英]Parse 'a' tags based on attribute using Python and BeautifulSoup
[英]Using BeautifulSoup with Python to parse page for attribute values
我正在嘗試將Python與BeautifulSoup一起使用,以瀏覽其ID值遞增1的頁面,並且試圖獲取其vid。 但是,vid的數量是可變的,具體取決於范圍ID(如下所示),它也不嵌套在原始tr下。
現在,我正在做一個循環以獲取span ID值,但是我試圖找出一種方法來獲取vid值作為每個span id的數組。
以下是我正在使用的示例html:
<tr>
<td>
<div>
<span class="apple-font" id="001">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099882"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="002">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="003">
</div>
</td>
</tr>
<tr>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<a vid="0099883"></a>
</td>
</tr>
<tr>
<td>
<div>
<span class="apple-font" id="004">
</div>
</td>
</tr>
<tr>
</tr>
以下是我正在使用的代碼/一直在嘗試但未取得太多進展的所有代碼:
soup = soup.findAll(class_="apple-font", id=True)
for s in soup:
n = str(s.get_text().lstrip().replace(".",""))
print n
print
我會使用迭代方法; 從第一個<span class="apple-font">
標記開始,循環遍歷同一張表中的所有tr
元素,並在每次找到具有新id
的行時開始一個新組:
table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
id_span = tr.find(class_='apple-font', id=True)
if id_span is not None:
# new group
group = []
groups[id_span['id']] = group
else:
vid_link = tr.find('a', vid=True)
if vid_link is not None:
group.append(vid_link['vid'])
演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
... <td>
... <div>
... <span class="apple-font" id="001">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099882"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="002">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="003">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <a vid="0099883"></a>
... </td>
... </tr>
...
... <tr>
... <td>
... <div>
... <span class="apple-font" id="004">
... </div>
... </td>
... </tr>
...
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
... id_span = tr.find(class_='apple-font', id=True)
... if id_span is not None:
... # new group
... group = []
... groups[id_span['id']] = group
... else:
... vid_link = tr.find('a', vid=True)
... if vid_link is not None:
... group.append(vid_link['vid'])
...
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.