[英]Scraping Facebook Messages from html files with rvest
As it is possible to download a copy of your Facebook data archive, and it provides html files of every individual chat you have. 由于可以下载您的Facebook数据档案的副本,并且它提供您所进行的每个聊天的html文件。 I would like to be able to get that into a dataframe for further analysis.
我希望能够将其放入数据框以进行进一步分析。
An example of one of the files looks like this: 其中一个文件的示例如下所示:
and I have uploaded an example of that html file here: https://gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html 并且我在此处上传了该html文件的示例: https : //gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html
My ideal would be to get the data into a dataframe with the columns: sender, message, time. 我的理想做法是将数据放入具有以下几列的数据帧中:发件人,消息,时间。
So using 所以用
library(rvest)
doc <- "https://gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html"
doc %>% read_html()
returns 退货
#> {xml_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<base href="../">\n<style type="text/c ...
#> [2] <body>\n<a href="html/messages.htm">Back</a><br><br><div class="thread">Conversation with p1, p2<div class="message ..
And using the selector tool in Chrome to try and extract something more: 并使用Chrome中的选择器工具尝试提取更多内容:
doc %>% read_html() %>% html_node(xpath = '/html/body/div/div[1]')
#> {xml_node}
#> <div class="message">
#> [1] <div class="message_header">\n<span class="user">p1</span><span class="meta">Monday, 19 March 2012 at 23:29 UTC</sp ...
or 要么
doc %>% read_html() %>% html_node(xpath = '/html/body/div/p/text()') %>% html_text()
#> [1] "I didn't see your message before, i'm sorry that i didn't answer. Next time i promise !!"
I'm not very familiar with html
or rvest
so I'm not sure about the best way to extract the full list of messages and associated info into a data.frame
. 我对
html
或rvest
不太熟悉,所以我不确定将消息和相关信息的完整列表提取到data.frame
的最佳方法。
This article could help you a lot: https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/ 本文对您有很大帮助: https : //blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/
Especially the hint for http://selectorgadget.com , which makes it a lot easier to find appropriate tags to extract. 特别是http://selectorgadget.com的提示,这使查找要提取的适当标记变得容易得多。
Your current example would work like this: 您当前的示例将如下所示:
library(tidyverse)
library(rvest)
doc <- "https://gist.githubusercontent.com/eldenvo/182efcd870f74d715b202f3ccdae335e/raw/1b53610459790489efb43ab6caa0f15103d391a1/facebook-message.html"
pg <- doc %>% read_html()
We create a little helper to re-use a few times: 我们创建了一个小助手来重复使用几次:
extract_nodes <- function(pg, css) {
pg %>%
html_nodes(css) %>%
html_text()
}
Next, we extract the relevant part about the date. 接下来,我们提取有关日期的相关部分。 After that, we need to process and parse the date.
之后,我们需要处理和解析日期。 I remove the beginning of the string "Monday, ....", afterwards it is just a matter of setting the correct parameters for
parse_datetime
, which can be found in the help-file. 我删除了字符串“ Monday,....”的开头,之后只需为
parse_datetime
设置正确的参数parse_datetime
,该参数可以在帮助文件中找到。
dates <- pg %>%
extract_nodes("span[class='meta']") %>%
str_replace("^.*,\\s", "") %>%
parse_datetime(format = "%d %B %Y %* %H:%M %*")
Once the dates are settled, we can easily parse messages and users: 日期确定后,我们可以轻松解析消息和用户:
result <- data_frame(
user = extract_nodes(pg, "span[class='user']"),
dates = dates,
message = extract_nodes(pg, "p")
)
result
#> # A tibble: 4 x 3
#> user dates
#> <chr> <dttm>
#> 1 p1 2012-03-19 23:29:00
#> 2 p2 2012-03-19 15:39:00
#> 3 p1 2012-03-19 08:34:00
#> 4 p1 2012-03-18 20:24:00
#> # ... with 1 more variables: message <chr>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.