繁体   English   中英

如何使用R从网站抓取Web表

[英]how do to scrape web table from website using R

我正在尝试抓取从以下网站找到的表: https://finance.yahoo.com/gainers?e=us : https://finance.yahoo.com/gainers?e=us

但是,我已经在几种不同的方法中搜索答案,以从此处站点刮取表格,但这些方法都不适合我。

我努力了:

library(xml2)
url <- "https://finance.yahoo.com/gainers?e=us"
tbl <- read_html(url)

也:

 library(XML)
 url <- "https://finance.yahoo.com/gainers?e=us"
 tbl <- readHTMLList(url)

和其他软件包,例如rvest但是我无法显示该表!

他们将数据存储在页面本身的javascript块中。 您有两种选择:要么使用RSelenium,然后处理您必须放松的非常复杂的表,要么在获得V8软件包的帮助后执行一些字符串外科手术以及JSON处理操作:

library(V8)
library(xml2)
library(stringi)
library(jsonlite)
library(ndjson)

pg <- read_html("https://finance.yahoo.com/gainers?e=us")

下一位从显着的<script>标记中提取数据。 我正在按位置进行操作,这意味着这是Yahoo!打破的第一个潜在问题。 更改格式(例如,如果VZ完成购买,则通过javascript插入无用的Verizon Wireless广告)。 您可以更改它以查找文本指示器,但这会导致第二个问题……

用这种方式“抓取”的第二个问题是,如果数据的javascript发生了变化,这也会中断。 但是 ,对于基于HTML的抓取也是如此。 此方法不需要启动Web服务器和用于导航Web服务器的控制服务器。 无论如何…

html_nodes(pg, "script")[13] %>% 
  html_text() %>% 
  stri_replace_first_fixed("(function (root) {", "var root = { App : {}};\n") %>% 
  stri_replace_last_fixed("}(this));", "") -> js

我们必须做^^,因为<script>标记中的javascript代码应该位于浏览器中(我们不会)。 现在,我们通过V8评估较大的javascript块并提取主要数据元素:

ctx <- v8()
ctx$eval(JS(js))
root <- ctx$get("root", flatten=TRUE)

该数据元素保存该页面(实际上是单个页面应用程序)的所有数据。 因此,我们必须找到我们关心的数据,而这些数据确实位于嵌套的javascript漏洞的深处:

quotes <- root$App$main$context$dispatcher$stores$`QuoteDataStore-Immutable`$quoteData

有很多方法可以将嵌套列表数据转换为漂亮的矩形数据框。 下面的方法选择“将其转换为JSON,然后以完全“扁平”的格式将其从JSON带回来。可以随意使用在SO上可以找到的其他方法。

代码从告诉R忽略非库存开始(因为您大概只需要该页面上漂亮表中的引号):

discard(names(quotes), ~grepl("[\\^\\=]", .)) %>% 
  map_df(~ndjson::flatten(toJSON(quotes[[.]]))) %>% 
  glimpse()

## Observations: 30
## Variables: 53
## $ averageDailyVolume3Month.fmt.0      <chr> "437,939", "541,801", "1.033M", "992,278", "1.40...
## $ averageDailyVolume3Month.longFmt.0  <chr> "437,939", "541,801", "1,033,453", "992,278", "1...
## $ averageDailyVolume3Month.raw.0      <dbl> 437939, 541801, 1033453, 992278, 1402175, 537906...
## $ exchange.0                          <chr> "NYQ", "NMS", "NYQ", "NGM", "NGM", "NMS", "NGM",...
## $ exchangeTimezoneName.0              <chr> "America/New_York", "America/New_York", "America...
## $ exchangeTimezoneShortName.0         <chr> "EST", "EST", "EST", "EST", "EST", "EST", "EST",...
## $ fiftyTwoWeekHigh.fmt.0              <chr> "15.36", "46.12", "15.82", "7.64", "5.97", "9.86...
## $ fiftyTwoWeekHigh.raw.0              <dbl> 15.360, 46.120, 15.820, 7.640, 5.970, 9.860, 5.0...
## $ fiftyTwoWeekHighChange.fmt.0        <chr> "-6.96", "-17.72", "-4.19", "-2.83", "-4.29", "-...
## $ fiftyTwoWeekHighChange.raw.0        <dbl> -6.960, -17.720, -4.190, -2.830, -4.290, -3.640,...
## $ fiftyTwoWeekHighChangePercent.fmt.0 <chr> "-45.31%", "-38.42%", "-26.49%", "-37.04%", "-71...
## $ fiftyTwoWeekHighChangePercent.raw.0 <dbl> -0.4531, -0.3842, -0.2649, -0.3704, -0.7186, -0....
## $ fiftyTwoWeekLow.fmt.0               <chr> "6.59", "23.75", "9.46", "4.22", "0.25", "4.97",...
## $ fiftyTwoWeekLow.raw.0               <dbl> 6.590, 23.750, 9.460, 4.220, 0.251, 4.970, 0.970...
## $ fiftyTwoWeekLowChange.fmt.0         <chr> "1.81", "4.65", "2.17", "0.59", "1.43", "1.25", ...
## $ fiftyTwoWeekLowChange.raw.0         <dbl> 1.810, 4.650, 2.170, 0.590, 1.429, 1.250, 0.190,...
## $ fiftyTwoWeekLowChangePercent.fmt.0  <chr> "27.47%", "19.58%", "22.94%", "13.98%", "569.32%...
## $ fiftyTwoWeekLowChangePercent.raw.0  <dbl> 0.2747, 0.1958, 0.2294, 0.1398, 5.6932, 0.2515, ...
## $ fullExchangeName.0                  <chr> "NYSE", "NasdaqGS", "NYSE", "NasdaqGM", "NasdaqG...
## $ gmtOffSetMilliseconds.0             <dbl> -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07, -1.8e+07...
## $ invalid.0                           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isLoading.0                         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ language.0                          <chr> "en-US", "en-US", "en-US", "en-US", "en-US", "en...
## $ longName.0                          <chr> "Bankrate, Inc.", "Air Methods Corp.", "Bristow ...
## $ market.0                            <chr> "us_market", "us_market", "us_market", "us_marke...
## $ marketCap.fmt.0                     <chr> "758.268M", "1.082B", "407.786M", "494.03M", "33...
## $ marketCap.longFmt.0                 <chr> "758,267,968", "1,081,625,344", "407,786,176", "...
## $ marketCap.raw.0                     <dbl> 758267968, 1081625344, 407786176, 494030272, 338...
## $ marketState.0                       <chr> "CLOSED", "CLOSED", "CLOSED", "CLOSED", "CLOSED"...
## $ messageBoardId.0                    <chr> "finmb_30061", "finmb_24494", "finmb_292980", "f...
## $ quoteType.0                         <chr> "EQUITY", "EQUITY", "EQUITY", "EQUITY", "EQUITY"...
## $ regularMarketChange.fmt.0           <chr> "1.20", "3.65", "1.44", "0.59", "0.27", "0.80", ...
## $ regularMarketChange.raw.0           <dbl> 1.20, 3.65, 1.44, 0.59, 0.27, 0.80, 0.15, 0.04, ...
## $ regularMarketChangePercent.fmt.0    <chr> "16.67%", "14.75%", "14.13%", "13.98%", "19.15%"...
## $ regularMarketChangePercent.raw.0    <dbl> 16.6667, 14.7475, 14.1315, 13.9810, 19.1489, 14....
## $ regularMarketDayHigh.fmt.0          <chr> "9.90", "30.25", "12.60", "5.29", "1.85", "6.30"...
## $ regularMarketDayHigh.raw.0          <dbl> 9.9000, 30.2500, 12.6000, 5.2900, 1.8500, 6.2950...
## $ regularMarketDayLow.fmt.0           <chr> "7.90", "25.30", "9.46", "4.30", "1.33", "5.79",...
## $ regularMarketDayLow.raw.0           <dbl> 7.9000, 25.3000, 9.4600, 4.3000, 1.3300, 5.7900,...
## $ regularMarketPrice.fmt.0            <chr> "8.40", "28.40", "11.63", "4.81", "1.68", "6.22"...
## $ regularMarketPrice.raw.0            <dbl> 8.40, 28.40, 11.63, 4.81, 1.68, 6.22, 1.16, 1.20...
## $ regularMarketTime.fmt.0             <chr> "4:02PM EDT", "4:00PM EDT", "4:00PM EDT", "4:00P...
## $ regularMarketTime.raw.0             <dbl> 1478289722, 1478289600, 1478289614, 1478289600, ...
## $ regularMarketVolume.fmt.0           <chr> "2.383M", "3.32M", "3.411M", "4.567M", "4.232M",...
## $ regularMarketVolume.longFmt.0       <chr> "2,382,856", "3,320,072", "3,411,052", "4,566,79...
## $ regularMarketVolume.raw.0           <dbl> 2382856, 3320072, 3411052, 4566790, 4232135, 185...
## $ sharesOutstanding.fmt.0             <chr> "90.27M", "38.085M", "35.063M", "102.709M", "20....
## $ sharesOutstanding.longFmt.0         <chr> "90,270,000", "38,085,400", "35,063,300", "102,7...
## $ sharesOutstanding.raw.0             <dbl> 90270000, 38085400, 35063300, 102709000, 2016800...
## $ shortName.0                         <chr> "Bankrate, Inc. Common Stock", "Air Methods Corp...
## $ sourceInterval.0                    <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, ...
## $ symbol.0                            <chr> "RATE", "AIRM", "BRS", "CERS", "EBIO", "ELNK", "...
## $ uuid.0                              <chr> "79521cde-a3ef-383f-917d-31c49f9082f5", "f0432c1...

我认为您仍然需要进行一些数据转换(取决于您要执行的操作),但是您拥有所需的数据。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM