使用RVest从HTML文件中抓取Facebook消息

Question

As it is possible to download a copy of your Facebook data archive, and it provides html files of every individual chat you have. 由于可以下载您的Facebook数据档案的副本，并且它提供您所进行的每个聊天的html文件。 I would like to be able to get that into a dataframe for further analysis. 我希望能够将其放入数据框以进行进一步分析。

An example of one of the files looks like this: 其中一个文件的示例如下所示：

and I have uploaded an example of that html file here: https://gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html 并且我在此处上传了该html文件的示例： https : //gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html

My ideal would be to get the data into a dataframe with the columns: sender, message, time. 我的理想做法是将数据放入具有以下几列的数据帧中：发件人，消息，时间。

So using 所以用

library(rvest)

doc <- "https://gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html"
doc %>% read_html()

returns 退货

#> {xml_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<base href="../">\n<style type="text/c ...
#> [2] <body>\n<a href="html/messages.htm">Back</a><br><br><div class="thread">Conversation with p1, p2<div class="message ..

And using the selector tool in Chrome to try and extract something more: 并使用Chrome中的选择器工具尝试提取更多内容：

doc %>% read_html() %>% html_node(xpath = '/html/body/div/div[1]')
#> {xml_node}
#> <div class="message">
#> [1] <div class="message_header">\n<span class="user">p1</span><span class="meta">Monday, 19 March 2012 at 23:29 UTC</sp ...

or 要么

doc %>% read_html() %>% html_node(xpath = '/html/body/div/p/text()') %>% html_text()

#> [1] "I didn't see your message before, i'm sorry that i didn't answer. Next time i promise !!"

I'm not very familiar with html or rvest so I'm not sure about the best way to extract the full list of messages and associated info into a data.frame . 我对html或rvest不太熟悉，所以我不确定将消息和相关信息的完整列表提取到data.frame的最佳方法。

Answer 1

This article could help you a lot: https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/ 本文对您有很大帮助： https : //blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/

Especially the hint for http://selectorgadget.com , which makes it a lot easier to find appropriate tags to extract. 特别是http://selectorgadget.com的提示，这使查找要提取的适当标记变得容易得多。

Your current example would work like this: 您当前的示例将如下所示：

library(tidyverse)
library(rvest)

doc <-  "https://gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html"

pg <- doc %>% read_html()

We create a little helper to re-use a few times: 我们创建了一个小助手来重复使用几次：

extract_nodes <- function(pg, css) {
  pg %>%
    html_nodes(css) %>%
    html_text()
}

Next, we extract the relevant part about the date. 接下来，我们提取有关日期的相关部分。 After that, we need to process and parse the date. 之后，我们需要处理和解析日期。 I remove the beginning of the string "Monday, ....", afterwards it is just a matter of setting the correct parameters for parse_datetime , which can be found in the help-file. 我删除了字符串“ Monday，....”的开头，之后只需为parse_datetime设置正确的参数parse_datetime ，该参数可以在帮助文件中找到。

dates <- pg %>%
  extract_nodes("span[class='meta']") %>%
  str_replace("^.*,\\s", "") %>%
  parse_datetime(format = "%d %B %Y %* %H:%M %*")

Once the dates are settled, we can easily parse messages and users: 日期确定后，我们可以轻松解析消息和用户：

result <- data_frame(
  user = extract_nodes(pg, "span[class='user']"),
  dates = dates,
  message = extract_nodes(pg, "p")

)
result
#> # A tibble: 4 x 3
#>    user               dates
#>   <chr>              <dttm>
#> 1    p1 2012-03-19 23:29:00
#> 2    p2 2012-03-19 15:39:00
#> 3    p1 2012-03-19 08:34:00
#> 4    p1 2012-03-18 20:24:00
#> # ... with 1 more variables: message <chr>

使用RVest从HTML文件中抓取Facebook消息

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-11-07 15:59:49

使用RVest从HTML文件中抓取Facebook消息

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-11-07 15:59:49

解决方案1
0 已采纳 2017-11-07 15:59:49