简体   繁体   English

如何使用Python抓取动态生成的URL页面?

[英]How do I scrape pages with dynamically generated URLs using Python?

I am trying to scrape http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios , but the traditional url string building technique doesn't work because the "full-company-name-is-inserted-in-the-path" string. 我试图刮掉http://www.dailyfinance.com/quote/NYSE/international-business-machines/IBM/financial-ratios ,但传统的网址字符串构建技术不起作用,因为“全公司名称-is-inserted-in-path“字符串。 And the exact "full-company-name" isn't known in advance. 并且事先并不知道确切的“完整公司名称”。 Only the company symbol, "IBM" is known. 只有公司的标志,“IBM”才知道。

Essentially, the way I scrape is by looping through an array of company symbol and build the url string before sending it to urllib2.urlopen(url). 从本质上讲,我刮擦的方式是循环遍历公司符号数组并在将url字符串发送到urllib2.urlopen(url)之前构建它。 But in this case, that can't be done. 但在这种情况下,这是不可能做到的。

For example, CSCO string is 例如,CSCO字符串是

http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios

and another example url string is AAPL: 另一个例子url字符串是AAPL:

http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios

So in order to get the url, I had to search the symbol in the input box on the main page: 因此,为了获取URL,我必须在主页的输入框中搜索符号:

http://www.dailyfinance.com/

I've noticed that when I type "CSCO" and inspect the search input at ( http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios in Firefox web developer network tab, I noticed that the get request is sending to 我注意到当我输入“CSCO”并在Firefox Web开发人员网络选项卡中的( http://www.dailyfinance.com/quote/NASDAQ/apple/AAPL/financial-ratios)检查搜索输入时,我注意到获取请求正在发送给

http://j.foolcdn.com/tmf/predictivesearch?callback=_predictiveSearch_csco&term=csco&domain=dailyfinance.com

and that the referer actually gives the path that I want to capture 并且引用者实际上给出了我想要捕获的路径

Host: j.foolcdn.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://www.dailyfinance.com/quote/NASDAQ/cisco-systems-inc/CSCO/financial-ratios?source=itxwebtxt0000007
Connection: keep-alive

Sorry for the long explanation. 很抱歉很长的解释。 So the question is how do I extract the url in the Referer? 所以问题是如何在Referer中提取网址? If that is not possible, how should I approach this problem? 如果那是不可能的,我应该如何处理这个问题? Is there another way? 还有另外一种方法吗?

I really appreciate your help. 我非常感谢你的帮助。

I like this question. 我喜欢这个问题。 And because of that, I'll give a very thorough answer. 因此,我会给出一个非常彻底的答案。 For this, I'll use my favorite Requests library along with BeautifulSoup4. 为此,我将使用我最喜欢的Requests库和BeautifulSoup4。 Porting over to Mechanize if you really want to use that is up to you. 如果您真的想使用它,请移植到Mechanize由您决定。 Requests will save you tons of headaches though. 请求将为您节省大量的麻烦。


First off, you're probably looking for a POST request. 首先,您可能正在寻找POST请求。 However, POST requests are often not needed if a search function brings you right away to the page you're looking for. 但是,如果搜索功能立即您带到您正在查找的页面,则通常不需要POST请求。 So let's inspect it, shall we? 所以让我们检查它,好吗?

When I land on the base URL, http://www.dailyfinance.com/ , I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. 当我登陆基本网址http://www.dailyfinance.com/ ,我可以通过Firebug或Chrome的检查工具进行简单的检查,当我在搜索栏上放入CSCO或AAPL并启用“跳转”时,有一个301 Moved Permanently状态代码。 What does this mean? 这是什么意思?

在此输入图像描述

In simple terms, I was transferred somewhere. 简单来说,我被转移到了某个地方。 The URL for this GET request is the following: 此GET请求的URL如下:

http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO

Now, we test if it works with AAPL by using a simple URL manipulation. 现在,我们通过使用简单的URL操作来测试它是否适用于AAPL。

import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url

The above gives the following result: 以上结果如下:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]

See how the URL of the response changed? 查看响应的URL如何更改? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code: 让我们通过在下面的代码中附加以下代码来查找/financial-ratios页面,从而进一步采用URL操作:

new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url

When ran, this gives is the following result: 运行时,这给出了以下结果:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]

Now we're on the right track. 现在我们走在正确的轨道上。 I will now try to parse the data using BeautifulSoup. 我现在将尝试使用BeautifulSoup解析数据。 My complete code is as follows: 我的完整代码如下:

from bs4 import BeautifulSoup as bsoup
import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)

soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row

I then try running this code, only to encounter an error with the following traceback: 然后我尝试运行此代码,只是遇到以下回溯的错误:

  File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
    div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'

Of note is the line 'NoneType' object... . 值得注意的是'NoneType' object... This means our target div does not exist! 这意味着我们的目标div不存在! Egads, but why am I seeing the following?! Egads,但为什么我会看到以下内容?!

在此输入图像描述

There can only be one explanation: the table is loaded dynamically! 只能有一个解释:表是动态加载的! Rats. 大鼠。 Let's see if we can find another source for the table. 让我们看看我们是否可以找到该表的另一个来源。 I study the page and see that there are scrollbars at the bottom. 我研究页面,看到底部有滚动条。 This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page. 这可能意味着表被加载到一个框架内,或者直接从另一个源加载并放入页面中的div

I refresh the page and watch the GET requests again. 我刷新页面并再次观看GET请求。 Bingo, I found something that seems a bit promising: 宾果,我发现了一些似乎有点前途的东西:

在此输入图像描述

A third-party source URL, and look, it's easily manipulable using the ticker symbol! 第三方源URL,看起来,它很容易使用股票代码进行操作! Let's try loading it into a new tab. 让我们尝试将其加载到新标签中。 Here's what we get: 这是我们得到的:

在此输入图像描述

WOW! 哇! We now have the very exact source of our data. 我们现在拥有非常确切的数据来源。 The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). 最后一个障碍是当我们尝试使用这个字符串拉出CSCO数据时它会起作用(记得我们去了CSCO - > AAPL,现在再次回到CSCO,所以你不会感到困惑)。 Let's clean up the string and ditch the role of www.dailyfinance.com here completely. 让我们彻底清理字符串并放弃www.dailyfinance.com的角色。 Our new url is as follows: 我们的新网址如下:

http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL

Let's try using that in our final scraper! 让我们在最后的刮刀中尝试使用它!

from bs4 import BeautifulSoup as bsoup
import requests as rq

csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick

r = rq.get(new_url)
soup = bsoup(r.content)

table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
    print row.get_text()

And our raw results for CSCO's financial ratios data is as follows: 我们对CSCO财务比率数据的原始结果如下:

Company
Industry


Valuation Ratios


P/E Ratio (TTM)
15.40
14.80


P/E High - Last 5 Yrs 
24.00
28.90


P/E Low - Last 5 Yrs
8.40
12.10


Beta
1.37
1.50


Price to Sales (TTM)
2.51
2.59


Price to Book (MRQ)
2.14
2.17


Price to Tangible Book (MRQ)
4.25
3.83


Price to Cash Flow (TTM)
11.40
11.60


Price to Free Cash Flow (TTM)
28.20
60.20


Dividends


Dividend Yield (%)
3.30
2.50


Dividend Yield - 5 Yr Avg (%)
N.A.
1.20


Dividend 5 Yr Growth Rate (%)
N.A.
144.07


Payout Ratio (TTM)
45.00
32.00


Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70


Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60


Growth Rates (%)


Sales - 5 Yr Growth Rate (%)
5.51
5.12


EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90


EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90


EPS - 5 Yr Growth Rate (%)
8.91
9.04


Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94


Financial Strength


Quick Ratio (MRQ)
2.40
2.70


Current Ratio (MRQ)
2.60
2.90


LT Debt to Equity (MRQ)
0.22
0.20


Total Debt to Equity (MRQ)
0.31
0.25


Interest Coverage (TTM)
18.90
19.10


Profitability Ratios (%)


Gross Margin (TTM)
63.20
62.50


Gross Margin - 5 Yr Avg
66.30
64.00


EBITD Margin (TTM)
26.20
25.00


EBITD - 5 Yr Avg
28.82
0.00


Pre-Tax Margin (TTM)
21.10
20.00


Pre-Tax Margin - 5 Yr Avg
21.60
18.80


Management Effectiveness (%)


Net Profit Margin (TTM)
17.10
17.65


Net Profit Margin - 5 Yr Avg
17.90
15.40


Return on Assets (TTM)
8.30
8.90


Return on Assets - 5 Yr Avg
8.90
8.00


Return on Investment (TTM)
11.90
12.30


Return on Investment - 5 Yr Avg
12.50
10.90


Efficiency


Revenue/Employee (TTM)
637,890.00
556,027.00


Net Income/Employee (TTM)
108,902.00
98,118.00


Receivable Turnover (TTM)
5.70
5.80


Inventory Turnover (TTM)
11.30
9.70


Asset Turnover (TTM)
0.50
0.50

[Finished in 2.0s]

Cleaning up the data is up to you. 清理数据取决于您。


One good lesson to learn from this scrape is not all data are contained in one page alone. 从这次刮擦中学到的一个好教训并不是所有数据都只包含在一个页面中。 It's pretty nice to see it coming from another static site. 很高兴看到它来自另一个静态网站。 If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach. 如果它是通过JavaScript或AJAX调用等生成的,我们的方法可能会遇到一些困难。

Hopefully you learned something from this. 希望你从中学到了一些东西。 Let us know if this helps and good luck. 如果这有帮助并祝你好运,请告诉我们。

Doesn't answer your specific question, but solves your problem. 不回答您的具体问题,但解决您的问题。

http://www.dailyfinance.com/quotes/{Company Symbol}/{Stock Exchange}

Examples: 例子:

http://www.dailyfinance.com/quotes/AAPL/NAS http://www.dailyfinance.com/quotes/AAPL/NAS

http://www.dailyfinance.com/quotes/IBM/NYSE http://www.dailyfinance.com/quotes/IBM/NYSE

http://www.dailyfinance.com/quotes/CSCO/NAS http://www.dailyfinance.com/quotes/CSCO/NAS

To get to the financial ratios page you could then employ something like this: 要进入财务比率页面,您可以使用以下内容:

import urllib2

def financial_ratio_url(symbol, stock_exchange):
    starturl  = 'http://www.dailyfinance.com/quotes/'
    starturl += '/'.join([symbol, stock_exchange])
    req = urllib2.Request(starturl)
    res = urllib2.urlopen(starturl)
    return '/'.join([res.geturl(),'financial-ratios'])

Example: 例:

financial_ratio_url('AAPL', 'NAS')
'http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 selenium 和 python 从动态生成的页面中抓取内容? - How do I scrape content from a dynamically generated page using selenium and python? 如何抓取在 Python 中使用 Selenium 动态生成的多个 div - How to scrape for multiple divs that are generated dynamically using Selenium in Python 如何在Python中抓取JS生成的登录令牌? - How do I scrape the login token generated by JS in Python? 我想使用 python 抓取竞争对手页面的多个 facebook url 以获取视频观看次数 - I want to scrape multiple facebook urls of my competitor's pages for video views count using python 如何使用python中的beautifulsoup从网页中获取数据 - How do I get scrape data from web pages using beautifulsoup in python 如何使用BeautifulSoup抓取用javascript生成的数据? - How do I scrape data generated with javascript using BeautifulSoup? 如何使用 beautifulsoup 和 python 抓取包含多个页面的站点? - How can I scrape a site with multiple pages using beautifulsoup and python? 如何从 Python 抓取的 URL 列表中的 URL 抓取数据? - How do I scrape data from URLs in a python-scraped list of URLs? 如何使用 python Selenium 滚动抓取动态加载网站 - How do I scrape dynamically loading website with scrolling using python Selenium 使用python阅读动态生成的网页 - Reading dynamically generated web pages using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM