简体   繁体   English

lapply 和 read_xml.character

[英]lapply and read_xml.character

Iam trying to extract data from a website using a custom function:我正在尝试使用自定义 function 从网站中提取数据:

library(tidyverse)
library(rvest)
url = "https://www.boerse.de/fundamental-analyse/garbage/" # last part does not change outcome, therefore 'garbage'
read_html_tables = function(ISIN){
  content <- read_html(paste0(url,ISIN,"#guv")) %>%
    html_table(dec = ",") %>%
    .[c(5:10)]
  return(content)
}

If I run this function with a given ISIN, eg US88579Y1010, I get the desired result.如果我使用给定的 ISIN(例如 US88579Y1010)运行此 function,我会得到所需的结果。 A list containing 6 tibbles with the data I want.一个包含 6 个小标题的列表,其中包含我想要的数据。 But if I wrap this function into lapply() with a vector containing a few hundred ISIN, I get the following error:但是,如果我将这个 function 包装到lapply()中,并带有一个包含几百个 ISIN 的向量,我会收到以下错误:

list_of_all <- lapply(X = df[,2], FUN = read_html_tables)

Error: x must be a string of length 1 Called from: read_xml.character(x, encoding = encoding, ..., as_html = TRUE, options = options)错误: x必须是长度为 1 的字符串调用自:read_xml.character(x, encoding = encoding, ..., as_html = TRUE, options = options)

If I call which(length(df[,2]) != 1) (the column where the ISINs are), I get integer(0), so there seems to be no issue with the ISIN column in this dataframe.如果我调用which(length(df[,2]) != 1) (ISIN 所在的列),我得到整数(0),所以这个 dataframe 中的 ISIN 列似乎没有问题。 And since it works with a single ISIN as input, the read_html(paste0(url,ISIN)) part seems to work as well.由于它使用单个 ISIN 作为输入,因此read_html(paste0(url,ISIN))部分似乎也可以正常工作。

I have used a very similar function before and wrapped it into lapply() .我之前使用过非常相似的 function 并将其包装到lapply()中。 The earlier function did basically exactly what this function does, but had to do some searching and combining for the correct URL to pass into the read_html(paste0(url,ISIN)) part (on another website).较早的 function 基本上与此 function 所做的完全相同,但必须进行一些搜索和组合以找到正确的read_html(paste0(url,ISIN))以传递到另一个网站。 Iam a bit puzzled, since this error did not occure beforehand.我有点困惑,因为这个错误事先没有发生。 But if it occured and I try to run the earlier function now, I get the same error (which I didn't receive any time before).但是,如果它发生并且我现在尝试运行早期的 function,我会得到同样的错误(我之前没有收到过任何时候)。

Maybe there is a more talented R-programmer out there which can spot the issue?也许有一个更有才华的 R 程序员可以发现这个问题?

Edit: Since a reply suggested the ISIN-list is the issue: The first two are US88579Y1010 and US8318652091.编辑:由于回复表明 ISIN 列表是问题所在:前两个是 US88579Y1010 和 US8318652091。 Passed individually into the function as well as passing it in a vector ( c(ISIN1, ISIN2) ) and passing the vector to lapply works.单独传递到 function 以及将其传递到向量( c(ISIN1, ISIN2) )中并将向量传递给 lapply 工作。 But if I point at both ISINs inside the tibble ( df[1:2,2] ) I get the error from above.但是,如果我指向 tibble ( df[1:2,2] ) 内的两个 ISIN,我会从上面得到错误。 What am I missing here?我在这里想念什么?

Solution: read_xml.character from read_html() seems to not accept a column from a tibble as valid input.解决方案: read_html read_html()中的 read_xml.character 似乎不接受来自 tibble 的列作为有效输入。 Transfering the tibble to a data.frame and recalculating gives the desired output.将 tibble 传输到 data.frame 并重新计算得到所需的 output。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM