從html源將XML數據讀入R

Question

我想將數據從一個給定的網頁導入R，說這一個。

在源代碼中（但不在實際頁面上），我想要獲取的數據存儲在一行JavaScript代碼中，該代碼以這樣的方式開始：

chart_Line1.setDataXML("<graph rotateNames (stuff omitted) >
<set  value='699.99' name='16.02.2013'  />
<set  value='731.57' name='18.02.2013'  />  
<set  value='more values' name='more dates'  />
...
<trendLines> (now a different command starts, stuff omitted)
</trendLines></graph>")

（請注意，為了便於閱讀，我包括了換行符；數據在原始文件中的一行中。僅導入以chart_Line1.setDataXML開頭的行就足夠了-如果要在源代碼中輸入56行看看你自己）

我可以使用scan("URLofFile", what="raw")將整個html文件讀取為字符串，但是如何從中提取數據？

我可以使用what="..."指定數據格式，請記住沒有用於分隔數據的換行符，但是在不相關的前綴和后綴中存在多個換行符？

這是可以使用R工具很好地完成的事情，還是您建議該數據獲取應該使用其他腳本來完成？

Answer 1

經過反復試驗，我能夠找到包含數據的確切行。 我讀了整個HTML文件，然后處理所有其他線路。

require(zoo)
require(stringr)
# get html data, scrap all lines but the interesting one
theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod"
sec <- scan(file =theurl, what = "character", sep="\n")
sec <- sec[45]
# extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places
values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'")
# dispose of all non-numerical, non-separator values
values <- str_replace_all(unlist(values),"[^0-9/.]","")
# get all dates in the form "name='DD.MM.YYYY"
dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'")
# dispose of all non-numerical, non-separator values
dates <- str_replace_all(unlist(dates),"[^0-9/.]","")
# convert dates to canonical format
dates <- as.Date(dates,format="%d.%m.%Y")
# put values and dates into a list of ordered observations, converting the values from characters to numbers first.
MyZoo <- zoo(as.numeric(values),dates)

從html源將XML數據讀入R

問題描述

1 個解決方案

解決方案1
0 已采納 2014-02-09 21:56:59

從html源將XML數據讀入R

問題描述

1 個解決方案

解決方案1 0 已采納 2014-02-09 21:56:59

解決方案1
0 已采納 2014-02-09 21:56:59