简体   繁体   English

按天或月合并数据集

[英]Merge the dataset by day or month

I have the raw dataset.我有原始数据集。 The below raw dataset, which is sample data, has time and sentiment(positive, neutral, negative).下面的原始数据集是样本数据,具有时间和情绪(正面、中性、负面)。

This raw dataset is :这个原始数据集是:

created_time            neg_sentiment    neu_sentiment        pos_sentiment
2015-01-12T23:27:53+0000      0               0                   1
2015-01-13T00:36:15+0000      0               0                   1
2015-01-13T00:39:37+0000      0.02            0                 0.98
2015-01-13T01:26:05+0000      0.41           0.59                 0
2015-01-15T16:10:46+0000      0.14           0.02               0.84
2015-02-13T02:38:59+0000      0.86           0.1                  0
2015-01-13T21:00:15+0000      1               0                   0
2015-01-14T04:47:47+0000      0.96           0.04                 0
2015-02-14T06:09:17+0000      1               0                   0
2015-02-14T06:10:05+0000      1               0                   0
2015-01-14T06:44:47+0000      0.65           0.3                  0
2015-03-14T06:47:13+0000      0.07           0.93                 0
2015-01-14T10:16:09+0000      0               0                   1
2015-01-14T10:17:38+0000      0.08           0.85               0.07
2015-01-14T17:30:03+0000      1               0                   0
2015-01-14T20:17:43+0000      0.11            0                  0.89
2015-01-16T02:49:13+0000      0.5            0.5                  0
2015-03-26T13:20:06+0000      1               0                   0
2015-01-21T04:26:45+0000      0.39            0.01               0.6
2015-03-21T04:38:49+0000      0.01            0                  0.99

Using this dataset, I want to make the two desired outputs :使用此数据集,我想制作两个所需的输出:

negative_proportion is calculated by neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment) The first output is by month: negative_proportion 由 neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment) 计算得到,第一个输出是按月:

created_time        negative_proportion
  2015-01               10
  2015-02               20
  2015-03                5

The second output is by day:第二个输出是按天:

created_time        negative_proportion
2015-01-12              10
2015-01-13              20
2015-01-14              3
2015-01-15              3
2015-01-16              3
2015-02-13              3
2015-02-14              3
2015-03-14              3
2015-03-21              3
2015-03-26              5

How could I make the desired output?我怎么能做出想要的输出? Could you please help me or suggest the code?你能帮我或建议代码吗?

The generated "dput" data based on original dataset is below基于原始数据集生成的“dput”数据如下

structure(list(created_time = structure(c(1L, 2L, 3L, 4L, 12L, 
15L, 5L, 6L, 16L, 17L, 7L, 18L, 8L, 9L, 10L, 11L, 13L, 20L, 14L, 
19L), .Label = c("2015-01-12T23:27:53+0000", "2015-01-13T00:36:15+0000", 
"2015-01-13T00:39:37+0000", "2015-01-13T01:26:05+0000", "2015-01-13T21:00:15+0000", 
"2015-01-14T04:47:47+0000", "2015-01-14T06:44:47+0000", "2015-01-14T10:16:09+0000", 
"2015-01-14T10:17:38+0000", "2015-01-14T17:30:03+0000", "2015-01-14T20:17:43+0000", 
"2015-01-15T16:10:46+0000", "2015-01-16T02:49:13+0000", "2015-01-21T04:26:45+0000", 
"2015-02-13T02:38:59+0000", "2015-02-14T06:09:17+0000", "2015-02-14T06:10:05+0000", 
"2015-03-14T06:47:13+0000", "2015-03-21T04:38:49+0000", "2015-03-26T13:20:06+0000"
), class = "factor"), neg_sentiment = c(0, 0, 0.02, 0.41, 0.14, 
0.86, 1, 0.96, 1, 1, 0.65, 0.07, 0, 0.08, 1, 0.11, 0.5, 1, 0.39, 
0.01), neu_sentiment = c(0, 0, 0, 0.59, 0.02, 0.14, 0, 0.04, 
0, 0, 0.35, 0.93, 0, 0.85, 0, 0, 0.5, 0, 0.01, 0), pos_sentiment = c(1, 
1, 0.98, 0, 0.84, 0, 0, 0, 0, 0, 0, 0, 1, 0.07, 0, 0.89, 0, 0, 
0.6, 0.99)), class = "data.frame", row.names = c(NA, -20L))

You can use lubridate on created time您可以在创建时间使用 lubridate

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

library(tidyverse)



df_example <- structure(list(created_time = structure(c(1L, 2L, 3L, 4L, 12L, 
                                          15L, 5L, 6L, 16L, 17L, 7L, 18L, 8L, 9L, 10L, 11L, 13L, 20L, 14L, 
                                          19L), .Label = c("2015-01-12T23:27:53+0000", "2015-01-13T00:36:15+0000", 
                                                           "2015-01-13T00:39:37+0000", "2015-01-13T01:26:05+0000", "2015-01-13T21:00:15+0000", 
                                                           "2015-01-14T04:47:47+0000", "2015-01-14T06:44:47+0000", "2015-01-14T10:16:09+0000", 
                                                           "2015-01-14T10:17:38+0000", "2015-01-14T17:30:03+0000", "2015-01-14T20:17:43+0000", 
                                                           "2015-01-15T16:10:46+0000", "2015-01-16T02:49:13+0000", "2015-01-21T04:26:45+0000", 
                                                           "2015-02-13T02:38:59+0000", "2015-02-14T06:09:17+0000", "2015-02-14T06:10:05+0000", 
                                                           "2015-03-14T06:47:13+0000", "2015-03-21T04:38:49+0000", "2015-03-26T13:20:06+0000"
                                          ), class = "factor"), neg_sentiment = c(0, 0, 0.02, 0.41, 0.14, 
                                                                                  0.86, 1, 0.96, 1, 1, 0.65, 0.07, 0, 0.08, 1, 0.11, 0.5, 1, 0.39, 
                                                                                  0.01), neu_sentiment = c(0, 0, 0, 0.59, 0.02, 0.14, 0, 0.04, 
                                                                                                           0, 0, 0.35, 0.93, 0, 0.85, 0, 0, 0.5, 0, 0.01, 0), pos_sentiment = c(1, 
                                                                                                                                                                                1, 0.98, 0, 0.84, 0, 0, 0, 0, 0, 0, 0, 1, 0.07, 0, 0.89, 0, 0, 
                                                                                                                                                                                0.6, 0.99)), class = "data.frame", row.names = c(NA, -20L))

df_example %>% 
  group_by(year(created_time),month(created_time)) %>% 
  summarise_if(is.numeric,~sum(.,na.rm = TRUE)) %>% 
  mutate(prop = neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment))
#> # A tibble: 3 x 6
#> # Groups:   year(created_time) [1]
#>   `year(created_t… `month(created_… neg_sentiment neu_sentiment pos_sentiment
#>              <dbl>            <dbl>         <dbl>         <dbl>         <dbl>
#> 1             2015                1          5.26          2.36          6.38
#> 2             2015                2          2.86          0.14          0   
#> 3             2015                3          1.08          0.93          0.99
#> # … with 1 more variable: prop <dbl>


df_example %>% 
  group_by(as_date(created_time)) %>% 
  summarise_if(is.numeric,~sum(.,na.rm = TRUE)) %>% 
  mutate(prop = neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment))
#> # A tibble: 11 x 5
#>    `as_date(created_time)` neg_sentiment neu_sentiment pos_sentiment  prop
#>    <date>                          <dbl>         <dbl>         <dbl> <dbl>
#>  1 2015-01-12                       0             0             1    0    
#>  2 2015-01-13                       1.43          0.59          1.98 0.358
#>  3 2015-01-14                       2.8           1.24          1.96 0.467
#>  4 2015-01-15                       0.14          0.02          0.84 0.14 
#>  5 2015-01-16                       0.5           0.5           0    0.5  
#>  6 2015-01-21                       0.39          0.01          0.6  0.39 
#>  7 2015-02-13                       0.86          0.14          0    0.86 
#>  8 2015-02-14                       2             0             0    1    
#>  9 2015-03-14                       0.07          0.93          0    0.07 
#> 10 2015-03-21                       0.01          0             0.99 0.01 
#> 11 2015-03-26                       1             0             0    1

Created on 2020-01-08 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 1 月 8 日创建

I would use a command like substring() to extract the first 10 characters of your created_time variable to create a variable with the date.我会使用像 substring() 这样的命令来提取 created_time 变量的前 10 个字符来创建一个带有日期的变量。 You could do this to create a variable for month as well.您也可以这样做来为月份创建一个变量。

data$day <- substring(data$created_time, 1, 10)
#and/or
data$month <- substring(data$created_time, 1, 6)

You've already provided the formula for calculating the negative, so that's easy enough:您已经提供了计算负数的公式,所以这很简单:

data$negative_proportion <- data$neg_sentiment/(data$neg_sentiment + data$neu_sentiment + data$pos_sentiment)

Good luck!祝你好运!

Here are the base R codes这是基本的R代码

  • by month按月
# by month
dfout <- aggregate(df[-1], data.frame(created_time = gsub("(\\d+-\\d+).*","\\1",df[,1])), sum)
dfout <- within(dfout, neg_prop <- neg_sentiment/rowSums(dfout[-1])*100)

such that以至于

> dfout
  created_time neg_sentiment neu_sentiment pos_sentiment neg_prop
1      2015-01          5.26          2.36          6.38 37.57143
2      2015-02          2.86          0.14          0.00 95.33333
3      2015-03          1.08          0.93          0.99 36.00000
  • by day白天
# by day
dfout <- aggregate(df[-1], data.frame(created_time = gsub("(\\d+-\\d+-\\d+).*","\\1",df[,1])), sum)
dfout <- within(dfout, neg_prop <- neg_sentiment/rowSums(dfout[-1])*100)

such that以至于

> dfout
   created_time neg_sentiment neu_sentiment pos_sentiment  neg_prop
1    2015-01-12          0.00          0.00          1.00   0.00000
2    2015-01-13          1.43          0.59          1.98  35.75000
3    2015-01-14          2.80          1.24          1.96  46.66667
4    2015-01-15          0.14          0.02          0.84  14.00000
5    2015-01-16          0.50          0.50          0.00  50.00000
6    2015-01-21          0.39          0.01          0.60  39.00000
7    2015-02-13          0.86          0.14          0.00  86.00000
8    2015-02-14          2.00          0.00          0.00 100.00000
9    2015-03-14          0.07          0.93          0.00   7.00000
10   2015-03-21          0.01          0.00          0.99   1.00000
11   2015-03-26          1.00          0.00          0.00 100.00000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM