[英]Web scraping and parsing HTML in R
I'm trying to parse this webpage into a dataframe but keep getting stuck using the XML package being told it's not XML. 我正在尝试将此网页解析为一个数据框,但由于被告知不是XML,所以继续使用XML包来卡住。
I would like to take the below text and convert into a table/data.frame - what is the easiest way to do this after i've taken the URL text and htmlParsed it? 我想将下面的文本转换为table / data.frame-在获取URL文本并对其进行htmlParsed之后,最简单的方法是什么?
doc = getURL(" http://m.racingpost.com/card/blocks.sd?race_id=first&r_date=2015-03-28&tab=card&view=meetings&blocks=cards-list&_=1427439140572 ") doc = htmlParse(doc, asText=T) doc = getURL(“ http://m.racingpost.com/card/blocks.sd?race_id=first&r_date=2015-03-28&tab=card&view=meetings&blocks=cards-list&_=1427439140572”)doc = htmlParse(doc,asText = T)
The URL is returning JSON. URL正在返回JSON。 You can parse it using a number of R packages
RJSONIO
, rjson
and jsonlite
: 您可以使用许多R包
RJSONIO
, rjson
和jsonlite
:
library(jsonlite)
appURL <- "http://m.racingpost.com/card/blocks.sd?race_id=first&r_date=2015-03-28&tab=card&view=meetings&blocks=cards-list&_=1427439140572"
appDATA <- fromJSON(appURL)
appITEMS <- appDATA[["cards-list"]][["items"]]
> appITEMS$c1083
$abandonedCount
[1] 0
$crsName
[1] "Chelmsford (AW)"
$crsAbbr
[1] "Cfd"
$isForeign
[1] ""
$races
id title distance cls crsId time date
1 620151 Buy Online At chelmsfordcityracecourse.com Maiden Stakes 1m 4 1083 2:20 2015-03-28
2 620152 Dubai World Cup toteplacepot Today Maiden Stakes (Plus 10 Race) 5f 4 1083 2:55 2015-03-28
3 620153 £1 Million totescoop6 Handicap 5f 2 1083 3:30 2015-03-28
4 620154 toteexacta Pick The 1,2 Handicap 6f 4 1083 4:05 2015-03-28
5 620155 totetrifecta Pick The 1,2,3 Handicap (Bobis Race) 1m 3 1083 4:40 2015-03-28
6 620156 totepool Handicap 1m2f 2 1083 5:15 2015-03-28
7 620157 Madness Live 3rd June Handicap 1m2f 4 1083 5:50 2015-03-28
timestamp raceGroup hCount abandoned videoId going offers
1 1427552400 8 57049 Standard NULL
2 1427554500 5 57050 Standard NULL
3 1427556600 Handicap 12 57051 Standard NULL
4 1427558700 Handicap 7 57052 Standard NULL
5 1427560800 Handicap 8 57053 Standard NULL
6 1427562900 Handicap 7 57054 Standard NULL
7 1427565000 Handicap 6 57055 Standard NULL
The data is not returned in a tabular format but you can work with the individual "items" to fit your needs. 数据不会以表格格式返回,但是您可以使用各个“项目”来满足您的需求。 The
jsonlite
package helpfully returns appropriate tabular structures also. jsonlite
包还有助于返回适当的表格结构。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.