簡體   English   中英

將BeautifulSoup與Python結合使用以解析頁面的屬性值

[英]Using BeautifulSoup with Python to parse page for attribute values

我正在嘗試將Python與BeautifulSoup一起使用,以瀏覽其ID值遞增1的頁面,並且試圖獲取其vid。 但是,vid的數量是可變的,具體取決於范圍ID(如下所示),它也不嵌套在原始tr下。

現在,我正在做一個循環以獲取span ID值,但是我試圖找出一種方法來獲取vid值作為每個span id的數組。

以下是我正在使用的示例html:

<tr>
    <td>
        <div>
            <span class="apple-font" id="001">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099882"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>


<tr>
    <td>
        <div>
            <span class="apple-font" id="002">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="003">
        </div>
    </td>
</tr>

<tr>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <a vid="0099883"></a>
    </td>
</tr>

<tr>
    <td>
        <div>
            <span class="apple-font" id="004">
        </div>
    </td>
</tr>

<tr>
</tr>

以下是我正在使用的代碼/一直在嘗試但未取得太多進展的所有代碼:

soup = soup.findAll(class_="apple-font", id=True)
for s in soup:       
   n = str(s.get_text().lstrip().replace(".",""))
   print n
print 

我會使用迭代方法; 從第一個<span class="apple-font">標記開始,循環遍歷同一張表中的所有tr元素,並在每次找到具有新id的行時開始一個新組:

table = soup.find(class_='apple-font', id=True).find_parent('table')
groups = {}
group = None
for tr in table.find_all('tr'):
    id_span = tr.find(class_='apple-font', id=True)
    if id_span is not None:
        # new group
        group = []
        groups[id_span['id']] = group
    else:
        vid_link = tr.find('a', vid=True)
        if vid_link is not None:
            group.append(vid_link['vid'])

演示:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="001">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099882"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="002">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="003">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <a vid="0099883"></a>
...     </td>
... </tr>
... 
... <tr>
...     <td>
...         <div>
...             <span class="apple-font" id="004">
...         </div>
...     </td>
... </tr>
... 
... <tr>
... </tr>
... '''
>>> soup = BeautifulSoup('<table>{}</table>'.format(sample))
>>> table = soup.find(class_='apple-font', id=True).find_parent('table')
>>> groups = {}
>>> group = None
>>> for tr in table.find_all('tr'):
...     id_span = tr.find(class_='apple-font', id=True)
...     if id_span is not None:
...         # new group
...         group = []
...         groups[id_span['id']] = group
...     else:
...         vid_link = tr.find('a', vid=True)
...         if vid_link is not None:
...             group.append(vid_link['vid'])
... 
>>> print groups
{'003': ['0099883', '0099883'], '002': ['0099883'], '001': ['0099882', '0099883', '0099883'], '004': []}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM