简体   繁体   English

R中的Web抓取和解析HTML

[英]Web scraping and parsing HTML in R

I'm trying to parse this webpage into a dataframe but keep getting stuck using the XML package being told it's not XML. 我正在尝试将此网页解析为一个数据框,但由于被告知不是XML,所以继续使用XML包来卡住。

I would like to take the below text and convert into a table/data.frame - what is the easiest way to do this after i've taken the URL text and htmlParsed it? 我想将下面的文本转换为table / data.frame-在获取URL文本并对其进行htmlParsed之后,最简单的方法是什么?

doc = getURL(" http://m.racingpost.com/card/blocks.sd?race_id=first&r_date=2015-03-28&tab=card&view=meetings&blocks=cards-list&_=1427439140572 ") doc = htmlParse(doc, asText=T) doc = getURL(“ http://m.racingpost.com/card/blocks.sd?race_id=first&r_date=2015-03-28&tab=card&view=meetings&blocks=cards-list&_=1427439140572”)doc = htmlParse(doc,asText = T)

The URL is returning JSON. URL正在返回JSON。 You can parse it using a number of R packages RJSONIO , rjson and jsonlite : 您可以使用许多R包RJSONIOrjsonjsonlite

library(jsonlite)
appURL <- "http://m.racingpost.com/card/blocks.sd?race_id=first&r_date=2015-03-28&tab=card&view=meetings&blocks=cards-list&_=1427439140572"
appDATA <- fromJSON(appURL)
appITEMS <- appDATA[["cards-list"]][["items"]]
> appITEMS$c1083
$abandonedCount
[1] 0

$crsName
[1] "Chelmsford (AW)"

$crsAbbr
[1] "Cfd"

$isForeign
[1] ""

$races
id                                                           title distance cls crsId time       date
1 620151        Buy Online At chelmsfordcityracecourse.com Maiden Stakes       1m   4  1083 2:20 2015-03-28
2 620152 Dubai World Cup toteplacepot Today Maiden Stakes (Plus 10 Race)       5f   4  1083 2:55 2015-03-28
3 620153                            &pound;1 Million totescoop6 Handicap       5f   2  1083 3:30 2015-03-28
4 620154                                toteexacta Pick The 1,2 Handicap       6f   4  1083 4:05 2015-03-28
5 620155               totetrifecta Pick The 1,2,3 Handicap (Bobis Race)       1m   3  1083 4:40 2015-03-28
6 620156                                               totepool Handicap     1m2f   2  1083 5:15 2015-03-28
7 620157                                  Madness Live 3rd June Handicap     1m2f   4  1083 5:50 2015-03-28
timestamp raceGroup hCount abandoned videoId    going offers
1 1427552400                8             57049 Standard   NULL
2 1427554500                5             57050 Standard   NULL
3 1427556600  Handicap     12             57051 Standard   NULL
4 1427558700  Handicap      7             57052 Standard   NULL
5 1427560800  Handicap      8             57053 Standard   NULL
6 1427562900  Handicap      7             57054 Standard   NULL
7 1427565000  Handicap      6             57055 Standard   NULL

The data is not returned in a tabular format but you can work with the individual "items" to fit your needs. 数据不会以表格格式返回,但是您可以使用各个“项目”来满足您的需求。 The jsonlite package helpfully returns appropriate tabular structures also. jsonlite包还有助于返回适当的表格结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM