將 HTML 強制轉換為 Dataframe

Question

我正在嘗試將新聞數據通過網絡抓取到 R 中。我正在嘗試搜索我下載的 html 以在每一行/行中查找關鍵字。 因此，如果頁面中的一行以“”開頭，我想要行/行的編號，然后隔離該行和行。

library(rvest)
googlenews<- html("https://news.google.com/")
grep("</div",googlenews)
**Error in as.vector(x, "character") : 
  cannot coerce type 'externalptr' to vector of type 'character'**


as.data.frame(googlenews)

Error in as.data.frame.default(googlenews) : 
  c("cannot coerce class \"c(\"HTMLInternalDocument\", \"HTMLInternalDocument\", \"XMLInternalDocument\", \" to a data.frame", "cannot coerce class \"\"XMLAbstractDocument\")\" to a data.frame")

如何將 html 對象強制轉換為數據框？

Answer 1

這里的主要問題是您假裝html() （或read_html() ）返回一個簡單的字符向量，您可以在其上使用grep() - 事實並非如此。

如果您想使用 rvest 的功能，請通過html_nodes()和html_text()使用它：

googlenews <- read_html("https://news.google.com/")
nodes <- html_nodes(googlenews, "div")
html_text( nodes )

...如果您想將 HTML 文件作為簡單文本處理，請使用以下內容：

googlenews <- readLines("https://news.google.com/")
grep("</div",googlenews)

對於像as.data.frame(googlenews)這樣的東西，必須有人編寫了一個函數，將一個類轉換為另一個類。 對於使用 rvest 獲得的樹表示，這並非微不足道，因此不存在。 對於 rvest，有優秀的包小插曲、示例、博客文章——你真的應該看看這些。

將 HTML 強制轉換為 Dataframe

問題描述

1 個解決方案

解決方案1
1 2015-12-15 22:39:17

將 HTML 強制轉換為 Dataframe

問題描述

1 個解決方案

解決方案1 1 2015-12-15 22:39:17

解決方案1
1 2015-12-15 22:39:17