简体   繁体   English

从html源将XML数据读入R

[英]Reading XML data into R from a html source

I'd like to import data into R from a given webpage, say this one . 我想将数据从一个给定的网页导入R,说这一个

In the source code (but not on the actual page), the data I'd like to get is stored in a single line of javascript code which starts like this: 在源代码中(但不在实际页面上),我想要获取的数据存储在一行JavaScript代码中,该代码以这样的方式开始:

chart_Line1.setDataXML("<graph rotateNames (stuff omitted) >
<set  value='699.99' name='16.02.2013'  />
<set  value='731.57' name='18.02.2013'  />  
<set  value='more values' name='more dates'  />
...
<trendLines> (now a different command starts, stuff omitted)
</trendLines></graph>")

(Note that I've included line breaks for readability; the data is in one single line in the original file. It would suffice to import only the line which starts with chart_Line1.setDataXML - it's line 56 in the source if you want to have a look yourself) (请注意,为了便于阅读,我包括了换行符;数据在原始文件中的一行中。仅导入以chart_Line1.setDataXML开头的行就足够了-如果要在源代码中输入56行看看你自己)

I can read the whole html file into a string using scan("URLofFile", what="raw") , but how do I extract the data from this? 我可以使用scan("URLofFile", what="raw")将整个html文件读取为字符串,但是如何从中提取数据?

Can I specify the data format with what="..." , keeping in mind that there are no line breaks to separate the data, but several line breaks in the irrelevant prefix and suffix? 我可以使用what="..."指定数据格式,请记住没有用于分隔数据的换行符,但是在不相关的前缀和后缀中存在多个换行符?

Is this something which can be done in a nice way using R tools, or do you suggest that this data acquisition should rather be done with a different script? 这是可以使用R工具很好地完成的事情,还是您建议该数据获取应该使用其他脚本来完成?

With some trial & error, I was able to find the exact line where the data is contained. 经过反复试验,我能够找到包含数据的确切行。 I read the whole html file, and then dispose of all other lines. 我读了整个HTML文件,然后处理所有其他线路。

require(zoo)
require(stringr)
# get html data, scrap all lines but the interesting one
theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod"
sec <- scan(file =theurl, what = "character", sep="\n")
sec <- sec[45]
# extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places
values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'")
# dispose of all non-numerical, non-separator values
values <- str_replace_all(unlist(values),"[^0-9/.]","")
# get all dates in the form "name='DD.MM.YYYY"
dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'")
# dispose of all non-numerical, non-separator values
dates <- str_replace_all(unlist(dates),"[^0-9/.]","")
# convert dates to canonical format
dates <- as.Date(dates,format="%d.%m.%Y")
# put values and dates into a list of ordered observations, converting the values from characters to numbers first.
MyZoo <- zoo(as.numeric(values),dates)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM