简体   繁体   English

在R中绘制字符串随时间变化的频率

[英]Plotting the frequency of string matches over time in R

I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea) 我已经整理了过去几个月左右发送的推文的语料库,看起来像这样(实际语料库有更多的列,显然还有更多的行,但是您知道了)

id      when            time        day month   year    handle  what
UK1.1   Sat Feb 20 2016 12:34:02    20  2       2016    dave    Great goal by #lfc
UK1.2   Sat Feb 20 2016 15:12:42    20  2       2016    john    Can't wait for the weekend 
UK1.3   Sat Mar 01 2016 12:09:21    1   3       2016    smith   Generic boring tweet

Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. 现在我想在R中做的是,使用grep进行字符串匹配,绘制某些单词/标签随时间变化的频率,理想情况下是用该月/日/小时/任意时间的推文数量进行归一化。 But I have no idea how to do this. 但是我不知道该怎么做。

I know how to use grep to create subsets of this dataframe, eg for all tweets including the #lfc hashtag, but I don't really know where to go from there. 我知道如何使用grep来创建此数据帧的子集,例如,对于包括#lfc主题标签的所有tweet,但我真的不知道从那里去哪里。

The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. 另一个问题是,无论我的x轴上的时间标度是什么(小时/天/月等),都需要数字化,而“时间”列则不需要。 I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds. 我尝试将2月13日的“日”和“月”列连接为类似“ 2.13”的内容,但这导致R将2.13视为比2.7(“ 2月7日”)“更早”的问题。基于数学依据。

So basically, I'd like to make plots like these, where frequency of string x is plotted against time 所以基本上, 我想绘制这样的图,其中将字符串x的频率与时间作图

Thanks! 谢谢!

Here's one way to count up tweets by day. 这是一种按天计算推文的方法。 I've illustrated with a simplified fake data set: 我用一个简化的伪数据集进行了说明:

library(dplyr)
library(lubridate)

# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000), 
                 what = sample(LETTERS, 10000, replace=TRUE))

tweet.summary = dat %>% group_by(day = date(time)) %>%  # To summarise by month: group_by(month = month(time, label=TRUE))
  summarise(total.tweets = n(),
            A.tweets = sum(grepl("A", what)),
            pct.A = A.tweets/total.tweets,
            B.tweets = sum(grepl("B", what)),
            pct.B = B.tweets/total.tweets)            

tweet.summary 
  day total.tweets A.tweets pct.A B.tweets pct.B 1 2016-01-01 28 3 0.10714286 0 0.00000000 2 2016-01-02 27 0 0.00000000 1 0.03703704 3 2016-01-03 28 4 0.14285714 1 0.03571429 4 2016-01-04 27 2 0.07407407 2 0.07407407 ... 

Here's a way to plot the data using ggplot2 . 这是一种使用ggplot2绘制数据的方法。 I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages: 我还使用dplyrreshape2包对dplyr的数据帧进行了总结:

library(ggplot2)
library(reshape2)
library(scales)

ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
         summarise(A = sum(grepl("A", what))/n(),
                   B = sum(grepl("B", what))/n()) %>%
         melt(id.var="Month"),
       aes(Month, value, colour=variable, group=variable)) +
  geom_line() +
  theme_bw() +
  scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
  labs(colour="", y="")

在此处输入图片说明

Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct . 关于日期格式问题,以下是获取数字日期的方法:您可以使用as.Date将日期月份和年份列转换为日期,和/或使用as.Date将日期,月份,年份和时间列转换为日期时间列: as.POSIXct Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. 两者都将具有附加了日期类的基础数字值,因此R在绘图函数和其他函数中将它们视为日期。 Once you've done this conversion, you can run the code above to count up tweets by day, month, etc. 完成此转换后,您可以运行上面的代码按天,月等来计算推文。

# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016, 
                  time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))

# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-", 
                                         sprintf("%02d",month),"-", 
                                         sprintf("%02d", day)," ", 
                                         time)))

# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-", 
                                      sprintf("%02d",month),"-", 
                                      sprintf("%02d", day))))

dat2
  day month year time posix.date date 1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28 2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22 3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03 4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15 5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06 6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02 7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04 8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12 9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24 10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27 

You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date) . 通过执行as.numeric(dat2$posix.date) ,可以看到POSIXct日期的基础值是数字(自1970年1月1日午夜以来经过的as.numeric(dat2$posix.date) Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date) . 同样对于Date对象(自1970年1月1日起经过的天数): as.numeric(dat2$date)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM