简体   繁体   English

用美丽的汤 4 python 进行网页抓取

[英]Web scraping with beautiful soup 4 python

So I just started using beautiful soup 4 and I came across a problem which I've been trying to solve for a few days but I can't.所以我刚开始使用美丽的汤 4,我遇到了一个问题,我已经尝试解决了几天,但我不能。 Let me first paste the html code which I want to analyse:让我先粘贴我要分析的 html 代码:

<table class="table table-condensed table-hover tenlaces tablesorter">
<thead>
<tr>
<th class="al">Language</th>
<th class="ac">Link</th>
</tr>
</thead>
<tbody>


            <tr>
            <td class="tdidioma"><span class="flag flag_0">0</span></td>
            <td class="tdenlace"><a class="btn btn-mini enlace_link" data-servidor="42" rel="nofollow" target="_blank" title="Ver..." href="LINK I WANT TO SAVE0"><i class="icon-play"></i>&nbsp;&nbsp;Ver</a></td>
            </tr>

            <tr>
            <td class="tdidioma"><span class="flag flag_1">1</span></td>
            <td class="tdenlace"><a class="btn btn-mini enlace_link" data-servidor="42" rel="nofollow" target="_blank" title="Ver..." href="LINK I WANT TO SAVE1"><i class="icon-play"></i>&nbsp;&nbsp;Ver</a></td>
            </tr>

            <tr>
            <td class="tdidioma"><span class="flag flag_2">2</span></td>
            <td class="tdenlace"><a class="btn btn-mini enlace_link" data-servidor="42" rel="nofollow" target="_blank" title="Ver..." href="LINK I WANT TO SAVE2"><i class="icon-play"></i>&nbsp;&nbsp;Ver</a></td>
            </tr>
</tbody>
</table>

As you can see in each < tr > there are the < td > Language and Link.正如您在每个 <tr> 中看到的那样,有 <td> 语言和链接。 The problem is that I don't know how to relate the language to the link.问题是我不知道如何将语言与链接联系起来。 I mean, I'd like to select for example if the space in language is 1 return the link.我的意思是,我想选择例如语言中的空格是否为 1 返回链接。 If not, don't do anything.如果没有,请不要做任何事情。 But I'm only able to return the < td > with the language, not all the < tr > which is the important think I don't know if I made my point because I don't really know how to explain但我只能用语言返回 <td>,而不是所有的 <tr>,这是重要的,我不知道我是否表达了我的观点,因为我真的不知道如何解释

The code I have now gets the < tbody > from my main url but I don't really know how to make this I'm asking.我现在拥有的代码从我的主 url 获取 < tbody > 但我真的不知道如何做到这一点。

Thanks, and sorry for my bad English!谢谢,抱歉我的英语不好!

EDIT: Here is a sample of my code so you can see what libraries I'm using and everything编辑:这是我的代码示例,因此您可以查看我正在使用的库以及所有内容

from bs4 import BeautifulSoup
import urllib2

url = raw_input("Introduce URL to analyse: ")
page = urllib2.urlopen(url)
soup = Beautifulsoup(page.read())
body = soup.tbody
#HERE SHOULD BE WHAT I DON'T KNOW HOW TO DO
page.close()

Try something like this:尝试这样的事情:

result = None
for row in soup.tbody.find_all('tr'):
    lang, link = row.find_all('td')
    if lang.string == '1':
        result = link.a['href']
print result

Try to use the soup like this, Probably you need some exception handling here尝试使用这样的汤,可能您需要在这里进行一些异常处理

trs = soup.select('tr') # here trs is a list of bs4.element.Tag type element

Now iterate over the list,现在遍历列表,

for itm in trs:
    tds = itm.select('td')
    if tds:
        tdidoma, tdenlace = tds[0], tds[1] #assuming evey tr tag has atleast 2 td tags 
        print tdidoma.string
        print tdenlace.a['href']

I'm assuming you want to check if the URL contains 1 and save it if it does.我假设您想检查 URL 是否包含1并保存它。 Is this what you want?这是你想要的吗?

You can try playing with this code:您可以尝试使用以下代码:

soup = BeautifulSoup(YOUR_TEXT_HERE)
tbody_soup = soup.find('tbody')
links = tbody_soup.find_all('a')
links_to_save = []

for item in links:
    print item.attrs['href'] # prints the url
    print item.get_text() # prints the text of the link
    print item.attrs # prints a dictionary with all the attributes

    # check if 1 is in url?
    if '1' in item.attrs['href']:
        links_to_save.append(item.attrs['href'])

print links_to_save

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM