用美麗的湯 4 python 進行網頁抓取

Question

所以我剛開始使用美麗的湯 4，我遇到了一個問題，我已經嘗試解決了幾天，但我不能。 讓我先粘貼我要分析的 html 代碼：

<table class="table table-condensed table-hover tenlaces tablesorter">
<thead>
<tr>
<th class="al">Language</th>
<th class="ac">Link</th>
</tr>
</thead>
<tbody>


            <tr>
            <td class="tdidioma"><span class="flag flag_0">0</span></td>
            <td class="tdenlace"><a class="btn btn-mini enlace_link" data-servidor="42" rel="nofollow" target="_blank" title="Ver..." href="LINK I WANT TO SAVE0"><i class="icon-play"></i>&nbsp;&nbsp;Ver</a></td>
            </tr>

            <tr>
            <td class="tdidioma"><span class="flag flag_1">1</span></td>
            <td class="tdenlace"><a class="btn btn-mini enlace_link" data-servidor="42" rel="nofollow" target="_blank" title="Ver..." href="LINK I WANT TO SAVE1"><i class="icon-play"></i>&nbsp;&nbsp;Ver</a></td>
            </tr>

            <tr>
            <td class="tdidioma"><span class="flag flag_2">2</span></td>
            <td class="tdenlace"><a class="btn btn-mini enlace_link" data-servidor="42" rel="nofollow" target="_blank" title="Ver..." href="LINK I WANT TO SAVE2"><i class="icon-play"></i>&nbsp;&nbsp;Ver</a></td>
            </tr>
</tbody>
</table>

正如您在每個 <tr> 中看到的那樣，有 <td> 語言和鏈接。 問題是我不知道如何將語言與鏈接聯系起來。 我的意思是，我想選擇例如語言中的空格是否為 1 返回鏈接。 如果沒有，請不要做任何事情。 但我只能用語言返回 <td>，而不是所有的 <tr>，這是重要的，我不知道我是否表達了我的觀點，因為我真的不知道如何解釋

我現在擁有的代碼從我的主 url 獲取 < tbody > 但我真的不知道如何做到這一點。

謝謝，抱歉我的英語不好！

編輯：這是我的代碼示例，因此您可以查看我正在使用的庫以及所有內容

from bs4 import BeautifulSoup
import urllib2

url = raw_input("Introduce URL to analyse: ")
page = urllib2.urlopen(url)
soup = Beautifulsoup(page.read())
body = soup.tbody
#HERE SHOULD BE WHAT I DON'T KNOW HOW TO DO
page.close()

Answer 1

嘗試這樣的事情：

result = None
for row in soup.tbody.find_all('tr'):
    lang, link = row.find_all('td')
    if lang.string == '1':
        result = link.a['href']
print result

Answer 2

嘗試使用這樣的湯，可能您需要在這里進行一些異常處理

trs = soup.select('tr') # here trs is a list of bs4.element.Tag type element

現在遍歷列表，

for itm in trs:
    tds = itm.select('td')
    if tds:
        tdidoma, tdenlace = tds[0], tds[1] #assuming evey tr tag has atleast 2 td tags 
        print tdidoma.string
        print tdenlace.a['href']

Answer 3

我假設您想檢查 URL 是否包含1並保存它。 這是你想要的嗎？

您可以嘗試使用以下代碼：

soup = BeautifulSoup(YOUR_TEXT_HERE)
tbody_soup = soup.find('tbody')
links = tbody_soup.find_all('a')
links_to_save = []

for item in links:
    print item.attrs['href'] # prints the url
    print item.get_text() # prints the text of the link
    print item.attrs # prints a dictionary with all the attributes

    # check if 1 is in url?
    if '1' in item.attrs['href']:
        links_to_save.append(item.attrs['href'])

print links_to_save

用美麗的湯 4 python 進行網頁抓取

問題描述

3 個解決方案

解決方案1
1 2014-06-27 12:51:25

解決方案2
0 2014-06-27 13:30:49

解決方案3
0 2014-06-27 14:25:21

用美麗的湯 4 python 進行網頁抓取

問題描述

3 個解決方案

解決方案1 1 2014-06-27 12:51:25

解決方案2 0 2014-06-27 13:30:49

解決方案3 0 2014-06-27 14:25:21

解決方案1
1 2014-06-27 12:51:25

解決方案2
0 2014-06-27 13:30:49

解決方案3
0 2014-06-27 14:25:21