將BeautifulSoup與Python結合使用以解析頁面的屬性值

Question

我正在嘗試將Python與BeautifulSoup一起使用，以瀏覽其ID值遞增1的頁面，並且試圖獲取其vid。 但是，vid的數量是可變的，具體取決於范圍ID（如下所示），它也不嵌套在原始tr下。

現在，我正在做一個循環以獲取span ID值，但是我試圖找出一種方法來獲取vid值作為每個span id的數組。

以下是我正在使用的示例html：

<tr>
    <td>
        <div>
            <span class="apple-font" id="001">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099882"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>


<tr>
    <td>
        <div>
            <span class="apple-font" id="002">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="003">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="004">
        </div>
    </td>
</tr>

<tr>
</tr>

以下是我正在使用的代碼/一直在嘗試但未取得太多進展的所有代碼：

soup = soup.findAll(class_="apple-font", id=True)
for s in soup:       
   n = str(s.get_text().lstrip().replace(".",""))
   print n
print

Answer 1

我會使用迭代方法； 從第一個<span class="apple-font">標記開始，循環遍歷同一張表中的所有tr元素，並在每次找到具有新id的行時開始一個新組：

table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
    id_span = tr.find(class_='apple-font', id=True)
    if id_span is not None:
        # new group
        group = []
        groups[id_span['id']] = group
    else:
        vid_link = tr.find('a', vid=True)
        if vid_link is not None:
            group.append(vid_link['vid'])

演示：

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="001">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099882"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="002">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="003">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="004">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
...     id_span = tr.find(class_='apple-font', id=True)
...     if id_span is not None:
...         # new group
...         group = []
...         groups[id_span['id']] = group
...     else:
...         vid_link = tr.find('a', vid=True)
...         if vid_link is not None:
...             group.append(vid_link['vid'])
... 
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}

將BeautifulSoup與Python結合使用以解析頁面的屬性值

問題描述

1 個解決方案

解決方案1
1 已采納 2015-03-09 16:44:05

將BeautifulSoup與Python結合使用以解析頁面的屬性值

問題描述

1 個解決方案

解決方案1 1 已采納 2015-03-09 16:44:05

解決方案1
1 已采納 2015-03-09 16:44:05