[英]Aggregate zoo time series of tweets from multiple accounts
I've managed to confuse myself to a standstill when it comes to aggregating or binning a zoo object in R because I'm new to working with R and in particular working with time series data. 在R中汇总或合并Zoo对象时,我设法使自己陷入停顿,因为我刚开始使用R,尤其是使用时间序列数据。
Can anyone help me out? 谁能帮我吗?
I have a number of dataframes which gives the creation dates of a tweets and its ID for a number of specific twitter accounts 我有许多数据框,这些数据框提供了推文的创建日期及其针对多个特定Twitter帐户的ID
str(temp)
'data.frame': 1528 obs. of 2 variables:
$ id_str : chr "605698007263260672" "605681239408963584" "603854670856069120" "601792133297786880" ...
$ created_at: POSIXct, format: "2015-06-02 12:30:32" "2015-06-02 11:23:55" "2015-05-28 10:25:47" "2015-05-22 17:49:59" ...
I don't know how frequent the tweets were (the spacing between creation date values) but I then need to create a dataset which contains 我不知道鸣叫的频率(创建日期值之间的间隔),但是我需要创建一个数据集,其中包含
TimeSeries AccountName NumOfTweets 2010-01 MyTweeter 45 2010-02 YourTweeter 5
I would like to group according to the week or month created and count how many there were and plot them to show how a number of accounts compare to each other in number of tweets and sustained activity since records began. 我想根据创建的星期或月份进行分组,并统计创建的周数和月份数,以显示自记录开始以来,多个帐户在推文数量和持续活动方面如何相互比较。
Any advice on how to handle merging or joining time series so I can plot them with the time series on the x axis and the number of tweets on the Y 关于如何处理合并或加入时间序列的任何建议,因此我可以将它们与x轴上的时间序列以及Y轴上的推文数量一起绘制
Random sample of observations taken using select_n() and provided below using dput 使用select_n()并在下面使用dput提供的观察值的随机样本
dput(sample.df)
structure(list(id_str = c("235710687006035968", "148522094328680448",
"555743466945523712", "139818931253813249", "601792133297786880",
"391194341978669057", "455754624859779072", "139640022696603648",
"182085980864528384", "372375117130526720"), created_at = structure(c(1345032781,
1324245401, 1421334542, 1322170405, 1432313399, 1382102973, 1397495344,
1322127750, 1332247655, 1377616120), class = c("POSIXct", "POSIXt"
), tzone = "")), .Names = c("id_str", "created_at"), row.names = c(882L,
1363L, 33L, 1478L, 4L, 536L, 180L, 1489L, 1116L, 635L), class = "data.frame")
Example of desired output but need help in calculating the aggregate and merging multiple dataframes (1 per Account) into a suitable end data structure for plotting 所需输出的示例,但需要帮助计算汇总并将多个数据框(每个帐户1个)合并到合适的最终数据结构中以进行绘图
Does this resemble what you are looking for? 这与您要找的东西相似吗? First, convert
created_at
to monthly and count the observations (tweets) by ID and month: 首先,将
created_at
转换为每月,然后按ID和月份对观察值(推文)进行计数:
# To have some counts > 1 and several observations per ID
set.seed(123)
df2 <- data.frame(sample(df$id_str, size = 50, replace = T),
sample(df$created_at, size = 50, replace = T))
colnames(df2) <- colnames(df)
# Convert to months
df2$Month <- strftime(df2$created_at, format = "%Y-%m")
result <- aggregate(df2$id_str, by = list(df2$id_str, df2$Month), FUN = length)
colnames(result) <- c("ID", "Month", "nTweets")
head(result)
# ID Month nTweets
# 1 139640022696603648 2011-11 1
# 2 139818931253813249 2011-11 1
# 3 148522094328680448 2011-11 1
# 4 182085980864528384 2011-11 2
# 5 391194341978669057 2011-11 1
# 6 455754624859779072 2011-11 2
Then you can plot the result for example using ggplot: 然后您可以使用ggplot绘制结果图:
library(ggplot2)
ggplot(result, aes(x = Month, y = nTweets, group = ID, color = ID)) +
geom_line(size = 2)
Note that the x-axis is not correctly spaced here because some months have no observations. 请注意,此处的x轴间距不正确,因为有些月份没有观察到。 I suppose this is not true for the complete data.
我认为对于完整数据而言,这是不正确的。
Following Khl4v's code and a bit of trial and error 遵循Khl4v的代码和一些反复试验
Firstly Convert the char column "created_at" to a Date object using the required formatting string so it can be recognised as a date value 首先,使用所需的格式字符串将char列“ created_at”转换为Date对象,以便可以将其识别为日期值
MyDataFrame <- mutate(MyDataFrame,created_at = as.POSIXct(created_at, format="%a %b %d %H:%M:%S %z %Y"))
Now convert it to the Year-Month value before creating a new dataframe called df2 with a character string "Tweets" we will shortly count next as the year-month value changes 现在将其转换为Year-Month值,然后创建一个名为df2并带有字符串“ Tweets”的新数据框,我们将很快将其计为Year-month值的变化
df2 <- data.frame("Tweets",strftime(MyDataFrame$created_at, format = "%Y-%m"))
Rename the column names to be something more user friendly 重命名列名称,使其更加用户友好
colnames(df2) <- c("Tweeter","TimePeriod") Count using the aggregate function the number/length of times in columnd Tweeter for each change in the column value of TimePeriod colnames(df2)<-c(“ Tweeter”,“ TimePeriod”)使用汇总函数对TimePeriod列值的每次更改使用列式Tweeter中的次数/时间长度进行计数
result <- aggregate(df2$Tweeter, by = list(df2$TimePeriod), FUN = length)
Add another column to the result to store the name of the tweeter account used 在结果中添加另一列以存储所使用的高音帐户的名称
result <- mutate(result ,Account ="MyTwitter")
Rename the column names to be more user friendly 重命名列名称,以便于用户使用
colnames(result) <- c("TimePeriod","Tweets","Tweeter")
plot the result using ggplot and rotate the x labels so they are a bit easier to read 使用ggplot绘制结果并旋转x标签,以便于阅读
ggplot(result, aes(x = TimePeriod, y = Tweets, group = Tweeter, color = Tweeter)) + geom_line(size = 1) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.