Python美丽的汤，可从网页中抓取网址

Question

I am trying to scrape urls from the html format website. 我正在尝试从html格式的网站上抓取网址。 I use beautiful soup. 我用美丽的汤。 Here's a part of the html. 这是html的一部分。

                         <li style="display: block;">
                                <article itemscope itemtype="http://schema.org/Article">
                                    <div class="col-md-3 col-sm-3 col-xs-12" >
                                        <a href="/stroke?p=3083" class="article-image">
                                            <img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
                                        </a>
                                    </div>

                                    <div class="col-md-9 col-sm-9 col-xs-12">
                                        <div class="article-content">

                                                <a href="/stroke">
                                                    <img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
                                                </a>
                                            <a href="/stroke?p=3083" class="article-title">
                                                <div>
                                                    <h4 itemprop="name" id="playground">
Banana Good for health                                                         </h4>
                                                </div>
                                            </a>
                                            <div>                                               
                                                <div class="clear"></div>
                                                <span itemprop="dateCreated" style="font-size:10pt;color:#777;">
                                                    <i class="fa fa-clock-o" aria-hidden="true"></i>
09/10                                                       </span>
                                            </div>
                                            <p itemprop="description" class="hidden-phone">
                                                <a href="/stroke?p=3083">
                                                    I love Banana.
                                                </a>
                                            </p>
                                        </div>
                                    </div>
                                </article>
                            </li>

My code: 我的代码：

from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
    if link.has_attr('href'):
        print (link.attrs['href'])

The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? 结果将打印出此页面上的所有URL，但这不是我想要的，在此示例中，我只想要一个特定的字符，例如“ / stroke？p = 3083”，如何在python中设置条件？ (I know there are totally three "/stroke?p=3083" in this, but I just need one) （我知道总共有三个“ / stroke？p = 3083”，但我只需要一个）

Another question. 另一个问题。 This url is not complete, I need to combine them with " http://www.abcde.com " so the result will be " http://www.abcde.com/stroke?p=3083 ". 该网址不完整，我需要将它们与“ http://www.abcde.com ”结合使用，因此结果将是“ http://www.abcde.com/stroke?p=3083 ”。 I know I can use paste in R, but how to do this in Python? 我知道我可以在R中使用粘贴，但是如何在Python中做到这一点？ Thanks in advance! 提前致谢！ :) :)

Answer 1

Just put there a link in the scraper replacing some_link and give it a go. 只需在刮板中放置一个替换some_link的链接即可。 I suppose you will have your desired link along with it's full form. 我想您将拥有所需的链接以及完整的表单。

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
    print(urljoin(some_link,item['href']))

Answer 2

Another question. 另一个问题。 This url is not complete, I need to combine them with " http://www.abcde.com " so the result will be " http://www.abcde.com/stroke?p=3083 ". 该网址不完整，我需要将它们与“ http://www.abcde.com ”结合使用，因此结果将是“ http://www.abcde.com/stroke?p=3083 ”。 I know I can use paste in R, but how to do this in Python? 我知道我可以在R中使用粘贴，但是如何在Python中做到这一点？ Thanks in advance! 提前致谢！ :) :)

link = 'http://abcde.com' + link

Answer 3

You are getting most of it right already. 您已经充分了解了其中的大部分内容。 Collect the links as follows (just a list comprehension version of what you are doing already) 如下收集链接（只是您已经在做的列表理解版本）

urls = [url for url in bs.findall('a') if url.has_attr('href')]

This will give you the urls. 这将为您提供网址。 To get one of them, and append it to the abcde url you could simply do the following: 要获得其中之一，并将其附加到abcde url，您可以简单地执行以下操作：

if urls:
    new_url = 'http://www.abcde.com{}'.format(urls[0])

Python美丽的汤，可从网页中抓取网址

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-10-12 08:32:39

解决方案2
0 2017-10-12 08:20:23

解决方案3
0 2017-10-12 08:31:07

Python美丽的汤，可从网页中抓取网址

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-10-12 08:32:39

解决方案2 0 2017-10-12 08:20:23

解决方案3 0 2017-10-12 08:31:07

解决方案1
2 已采纳 2017-10-12 08:32:39

解决方案2
0 2017-10-12 08:20:23

解决方案3
0 2017-10-12 08:31:07