简体   繁体   English

如何使用R从网站抓取Web表

[英]how do to scrape web table from website using R

I am trying to scrape the table found from the following site: https://finance.yahoo.com/gainers?e=us 我正在尝试抓取从以下网站找到的表: https://finance.yahoo.com/gainers?e=us : https://finance.yahoo.com/gainers?e=us

However, I have searched for answers on several different methods to scrape tables from site on here and none of the methods have worked for me. 但是,我已经在几种不同的方法中搜索答案,以从此处站点刮取表格,但这些方法都不适合我。

I have tried: 我努力了:

library(xml2)
url <- "https://finance.yahoo.com/gainers?e=us"
tbl <- read_html(url)

also: 也:

 library(XML)
 url <- "https://finance.yahoo.com/gainers?e=us"
 tbl <- readHTMLList(url)

and other packages such as rvest however I cannot get the table to show! 和其他软件包,例如rvest但是我无法显示该表!

They store that data in a javascript block in the page itself. 他们将数据存储在页面本身的javascript块中。 You have two choices: either use RSelenium and then deal with a really complex table that you have to unwind or perform some string surgery combined with some JSON mangling after getting some help from the V8 package: 您有两种选择:要么使用RSelenium,然后处理您必须放松的非常复杂的表,要么在获得V8软件包的帮助后执行一些字符串外科手术以及JSON处理操作:

library(V8)
library(xml2)
library(stringi)
library(jsonlite)
library(ndjson)

pg <- read_html("https://finance.yahoo.com/gainers?e=us")

This next bit extracts the data from the salient <script> tag. 下一位从显着的<script>标记中提取数据。 I'm doing this by position, which means this is the first potential thing to break if Yahoo! 我正在按位置进行操作,这意味着这是Yahoo!打破的第一个潜在问题。 ever changes it format (to, say, insert useless Verizon Wireless ads via javascript if VZ ever does finish buying them). 更改格式(例如,如果VZ完成购买,则通过javascript插入无用的Verizon Wireless广告)。 You can change this to look for text indicators, but that leads to the second issue… 您可以更改它以查找文本指示器,但这会导致第二个问题……

The second issue with "scraping" this way is that if the javascript for the data ever changes this will also break. 用这种方式“抓取”的第二个问题是,如果数据的javascript发生了变化,这也会中断。 However that's also true for HTML-based scraping. 但是 ,对于基于HTML的抓取也是如此。 This method doesn't require starting up a web server and a control server for navigating the web server. 此方法不需要启动Web服务器和用于导航Web服务器的控制服务器。 Anyway… 无论如何…

html_nodes(pg, "script")[13] %>% 
  html_text() %>% 
  stri_replace_first_fixed("(function (root) {", "var root = { App : {}};\n") %>% 
  stri_replace_last_fixed("}(this));", "") -> js

We had to do ^^ since the javascript code in the <script> tag was expecting to be in a browser (which we won't be). 我们必须做^^,因为<script>标记中的javascript代码应该位于浏览器中(我们不会)。 Now, we evaluate that large javascript chunk via V8 and extract the main data element: 现在,我们通过V8评估较大的javascript块并提取主要数据元素:

ctx <- v8()
ctx$eval(JS(js))
root <- ctx$get("root", flatten=TRUE)

That data element holds all of the data for that page (which is really a single page app). 该数据元素保存该页面(实际上是单个页面应用程序)的所有数据。 So we have to find the data we care about which is really far down a nested javascript hole: 因此,我们必须找到我们关心的数据,而这些数据确实位于嵌套的javascript漏洞的深处:

quotes <- root$App$main$context$dispatcher$stores$`QuoteDataStore-Immutable`$quoteData

There are numerous ways to turn that nested list data into a nice, rectangular data frame. 有很多方法可以将嵌套列表数据转换为漂亮的矩形数据框。 The method below chooses the "turn it into JSON then bring it back from JSON in a completely "flat" format. Feel free to use other methods you can find on SO. 下面的方法选择“将其转换为JSON,然后以完全“扁平”的格式将其从JSON带回来。可以随意使用在SO上可以找到的其他方法。

The code starts by telling R to ignore the non-stocks (since you, presumably, want just the quotes in the pretty table on that page): 代码从告诉R忽略非库存开始(因为您大概只需要该页面上漂亮表中的引号):

discard(names(quotes), ~grepl("[\\^\\=]", .)) %>% 
  map_df(~ndjson::flatten(toJSON(quotes[[.]]))) %>% 
  glimpse()

## Observations: 30
## Variables: 53
## $ averageDailyVolume3Month.fmt.0      <chr> "437,939", "541,801", "1.033M", "992,278", "1.40...
## $ averageDailyVolume3Month.longFmt.0  <chr> "437,939", "541,801", "1,033,453", "992,278", "1...
## $ averageDailyVolume3Month.raw.0      <dbl> 437939, 541801, 1033453, 992278, 1402175, 537906...
## $ exchange.0                          <chr> "NYQ", "NMS", "NYQ", "NGM", "NGM", "NMS", "NGM",...
## $ exchangeTimezoneName.0              <chr> "America/New_York", "America/New_York", "America...
## $ exchangeTimezoneShortName.0         <chr> "EST", "EST", "EST", "EST", "EST", "EST", "EST",...
## $ fiftyTwoWeekHigh.fmt.0              <chr> "15.36", "46.12", "15.82", "7.64", "5.97", "9.86...
## $ fiftyTwoWeekHigh.raw.0              <dbl> 15.360, 46.120, 15.820, 7.640, 5.970, 9.860, 5.0...
## $ fiftyTwoWeekHighChange.fmt.0        <chr> "-6.96", "-17.72", "-4.19", "-2.83", "-4.29", "-...
## $ fiftyTwoWeekHighChange.raw.0        <dbl> -6.960, -17.720, -4.190, -2.830, -4.290, -3.640,...
## $ fiftyTwoWeekHighChangePercent.fmt.0 <chr> "-45.31%", "-38.42%", "-26.49%", "-37.04%", "-71...
## $ fiftyTwoWeekHighChangePercent.raw.0 <dbl> -0.4531, -0.3842, -0.2649, -0.3704, -0.7186, -0....
## $ fiftyTwoWeekLow.fmt.0               <chr> "6.59", "23.75", "9.46", "4.22", "0.25", "4.97",...
## $ fiftyTwoWeekLow.raw.0               <dbl> 6.590, 23.750, 9.460, 4.220, 0.251, 4.970, 0.970...
## $ fiftyTwoWeekLowChange.fmt.0         <chr> "1.81", "4.65", "2.17", "0.59", "1.43", "1.25", ...
## $ fiftyTwoWeekLowChange.raw.0         <dbl> 1.810, 4.650, 2.170, 0.590, 1.429, 1.250, 0.190,...
## $ fiftyTwoWeekLowChangePercent.fmt.0  <chr> "27.47%", "19.58%", "22.94%", "13.98%", "569.32%...
## $ fiftyTwoWeekLowChangePercent.raw.0  <dbl> 0.2747, 0.1958, 0.2294, 0.1398, 5.6932, 0.2515, ...
## $ fullExchangeName.0                  <chr> "NYSE", "NasdaqGS", "NYSE", "NasdaqGM", "NasdaqG...
## $ gmtOffSetMilliseconds.0             <dbl> -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07...
## $ invalid.0                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isLoading.0                         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ language.0                          <chr> "en-US", "en-US", "en-US", "en-US", "en-US", "en...
## $ longName.0                          <chr> "Bankrate, Inc.", "Air Methods Corp.", "Bristow ...
## $ market.0                            <chr> "us_market", "us_market", "us_market", "us_marke...
## $ marketCap.fmt.0                     <chr> "758.268M", "1.082B", "407.786M", "494.03M", "33...
## $ marketCap.longFmt.0                 <chr> "758,267,968", "1,081,625,344", "407,786,176", "...
## $ marketCap.raw.0                     <dbl> 758267968, 1081625344, 407786176, 494030272, 338...
## $ marketState.0                       <chr> "CLOSED", "CLOSED", "CLOSED", "CLOSED", "CLOSED"...
## $ messageBoardId.0                    <chr> "finmb_30061", "finmb_24494", "finmb_292980", "f...
## $ quoteType.0                         <chr> "EQUITY", "EQUITY", "EQUITY", "EQUITY", "EQUITY"...
## $ regularMarketChange.fmt.0           <chr> "1.20", "3.65", "1.44", "0.59", "0.27", "0.80", ...
## $ regularMarketChange.raw.0           <dbl> 1.20, 3.65, 1.44, 0.59, 0.27, 0.80, 0.15, 0.04, ...
## $ regularMarketChangePercent.fmt.0    <chr> "16.67%", "14.75%", "14.13%", "13.98%", "19.15%"...
## $ regularMarketChangePercent.raw.0    <dbl> 16.6667, 14.7475, 14.1315, 13.9810, 19.1489, 14....
## $ regularMarketDayHigh.fmt.0          <chr> "9.90", "30.25", "12.60", "5.29", "1.85", "6.30"...
## $ regularMarketDayHigh.raw.0          <dbl> 9.9000, 30.2500, 12.6000, 5.2900, 1.8500, 6.2950...
## $ regularMarketDayLow.fmt.0           <chr> "7.90", "25.30", "9.46", "4.30", "1.33", "5.79",...
## $ regularMarketDayLow.raw.0           <dbl> 7.9000, 25.3000, 9.4600, 4.3000, 1.3300, 5.7900,...
## $ regularMarketPrice.fmt.0            <chr> "8.40", "28.40", "11.63", "4.81", "1.68", "6.22"...
## $ regularMarketPrice.raw.0            <dbl> 8.40, 28.40, 11.63, 4.81, 1.68, 6.22, 1.16, 1.20...
## $ regularMarketTime.fmt.0             <chr> "4:02PM EDT", "4:00PM EDT", "4:00PM EDT", "4:00P...
## $ regularMarketTime.raw.0             <dbl> 1478289722, 1478289600, 1478289614, 1478289600, ...
## $ regularMarketVolume.fmt.0           <chr> "2.383M", "3.32M", "3.411M", "4.567M", "4.232M",...
## $ regularMarketVolume.longFmt.0       <chr> "2,382,856", "3,320,072", "3,411,052", "4,566,79...
## $ regularMarketVolume.raw.0           <dbl> 2382856, 3320072, 3411052, 4566790, 4232135, 185...
## $ sharesOutstanding.fmt.0             <chr> "90.27M", "38.085M", "35.063M", "102.709M", "20....
## $ sharesOutstanding.longFmt.0         <chr> "90,270,000", "38,085,400", "35,063,300", "102,7...
## $ sharesOutstanding.raw.0             <dbl> 90270000, 38085400, 35063300, 102709000, 2016800...
## $ shortName.0                         <chr> "Bankrate, Inc. Common Stock", "Air Methods Corp...
## $ sourceInterval.0                    <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, ...
## $ symbol.0                            <chr> "RATE", "AIRM", "BRS", "CERS", "EBIO", "ELNK", "...
## $ uuid.0                              <chr> "79521cde-a3ef-383f-917d-31c49f9082f5", "f0432c1...

I think you still have some data conversion to do (depending on what you're after), but you have the data that you're looking for. 我认为您仍然需要进行一些数据转换(取决于您要执行的操作),但是您拥有所需的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM