简体   繁体   English

dplyr 中的 group_by 日期列

[英]group_by date column in dplyr

After extensive searching on this issue, I still cannot find the solution.在对这个问题进行广泛搜索后,我仍然找不到解决方案。 I have a simple data frame with 43 rows and 2 columns.我有一个简单的数据框,有 43 行和 2 列。 My first column contains two dates.我的第一列包含两个日期。 The first date is printed 19 times and the other 24 times.第一个日期打印 19 次,其他 24 次。 The second column is temperature.第二列是温度。 I want to find the max and min temperature by date, but my code keeps printing the entire data set's max and min.我想按日期查找最高和最低温度,但我的代码不断打印整个数据集的最高和最低温度。

Data:数据:

Date <- c(rep(x = "2017-05-18", each= 19), rep(x = "2017-05-19", each= 24))


Temperature_F <- c(35, 35, 42, 49, 57, 63, 64, 67, 70, 71, 72, 71, 72, 70, 66, 61, 57, 54, 50, 49, 45, 44, 44, 42, 40, 39, 47, 53, 61, 67, 69, 
    72, 75, 76, 77, 76, 77, 75, 71, 66, 62, 58, 54)

NWS_temps1 <- data.frame(Date, Temperature_F)

Here is my dplyr code that keeps giving me the max and min for the entire temperature column when I think it should be giving me the max and min temperature by date.这是我的 dplyr 代码,当我认为它应该按日期给我最大和最小温度时,它会不断为我提供整个温度列的最大值和最小值。

NWS_temps1 <- tbl_df(NWS_temps1)

 NWS_temps1 %>%
  group_by(Date) %>% 
  summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))

The output I get is:我得到的输出是:

 Tmax Tmin
  77   35

When I am hoping for:当我希望:

Date        Tmax Tmin
2017-05-18   72   35
2017-05-19   77   39

I don't understand why Date isn't be grouped as it should.我不明白为什么 Date 没有按它应该的方式分组。 I've attempted changing Date to a factor as it is here, character, date object, and even POSIXct, but my result is always the total data frame max and min.我尝试将日期更改为一个因素,因为它在这里、字符、日期对象,甚至 POSIXct,但我的结果始终是总数据帧的最大值和最小值。

Any help is much appreciated.任何帮助深表感谢。

Thanks.谢谢。

NWS_temps1 %>%
group_by(as.character(Date)) %>% 
summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))

Looks like you are using the standard evaluation version group_by_() instead of the NSE version group_by() .看起来您正在使用标准评估版本group_by_()而不是 NSE 版本group_by() Try it without the underscore:不带下划线试试:

NWS_temps1 %>%
    group_by(Date) %>% 
    summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))

#> # A tibble: 2 x 3
#>         Date  Tmax  Tmin
#>        <chr> <dbl> <dbl>
#> 1 2017-05-18    72    35
#> 2 2017-05-19    77    39

The answers provided by others using dplyr should work.其他人使用dplyr提供的答案应该有效。 However, if for some reasons dplyr is not working.但是,如果由于某些原因dplyr不起作用。 Here is a solution using tapply from base R.这是使用 base R 中的tapply的解决方案。

dt <- data.frame(Date = unique(NWS_temps1$Date),
                 Tmax = tapply(NWS_temps1$Temperature_F, NWS_temps1$Date, FUN = max),
                 Tmin = tapply(NWS_temps1$Temperature_F, NWS_temps1$Date, FUN = min)) 

There are summarise functions in both the dplyr and plyr packages. dplyrplyr包中都有summarise函数。 I'm guessing that the order in which the packages were loaded meant that the plyr version of the function was being loaded, which would give you the results you were seeing.我猜测加载包的顺序意味着正在加载函数的plyr版本,这会给你你看到的结果。 You can manually specify which version of the function you want to use by prepending the package name like this: dplyr::summarise(...) .您可以手动指定要使用的函数版本,方法是在包名前面加上: dplyr::summarise(...)

# Specify the plyr version:
> NWS_temps1 %>%
+   group_by(Date) %>% 
+   plyr::summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
  Tmax Tmin
1   77   35

# Specify the dplyr version:
> NWS_temps1 %>%
+   group_by(Date) %>% 
+   dplyr::summarise(Tmax = max(Temperature_F), Tmin= min(Temperature_F))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
  Date        Tmax  Tmin
  <fct>      <dbl> <dbl>
1 2017-05-18    72    35
2 2017-05-19    77    39

Edit: I've just noticed that Kim had already posted this as a comment on the original question.编辑:我刚刚注意到 Kim 已经将其发布为对原始问题的评论。

I'm able to replicate the original group_by() issue when converting a date/time field represented as number to a date with as.Date() - this might happen when working with a date/time field imported from an Excel file because Excel stores dates as numbers.当使用as.Date()将表示为数字的日期/时间字段转换为日期时,我能够复制原始group_by()问题 - 在使用从 Excel 文件导入的日期/时间字段时可能会发生这种情况,因为 Excel将日期存储为数字。

library(dplyr)

dt = c(43167.86, 43167.59, 43167.59, 43167.23, 43182.60, 43168.17, 43182) 
df <- data_frame(date = dt)

df %>% 
  mutate(date = as.Date(date, origin = '1899-12-30')) %>% 
  group_by(date) %>% 
  summarize(obs = n())
# A tibble: 6 x 2
  date         obs
  <date>     <int>
1 2018-03-08     1
2 2018-03-08     2
3 2018-03-08     1
4 2018-03-09     1
5 2018-03-23     1
6 2018-03-23     1

That gives mulitple versions of the same dates for '2018-03-08' and '2018-03-23'.这为“2018-03-08”和“2018-03-23”提供了相同日期的多个版本。 One line of '2018-03-08' has two observations because there are two '43167.59' - the same date and time, while there are two other 43167, but both with different times. '2018-03-08' 的一行有两个观察值,因为有两个 '43167.59' - 相同的日期和时间,而另外两个 43167,但都具有不同的时间。 This appears that it could be a dplyr related issue as table(as.Date(df$date, origin = '1899-12-30')) works as expected.这似乎可能是dplyr相关问题,因为table(as.Date(df$date, origin = '1899-12-30'))按预期工作。

One option is using lubridate::ymd() :一种选择是使用lubridate::ymd()

library(lubridate)

df %>% 
  mutate(date = as.Date(date, origin = '1899-12-30')) %>% 
  mutate(date = ymd(date)) %>% 
  group_by(date) %>% 
  summarize(obs = n())
# A tibble: 3 x 2
  date         obs
  <date>     <int>
1 2018-03-08     4
2 2018-03-09     1
3 2018-03-23     2

Another (crude) solution is to convert the date to a character, and then back if you want to keep it as a date:另一个(粗略的)解决方案是将日期转换为字符,如果您想将其保留为日期,则返回:

df %>% 
  mutate(date = as.Date(date, origin = '1899-12-30')) %>% 
  mutate(date = as.Date(as.character(date))) %>% 
  group_by(date) %>% 
  summarize(obs = n())

The best solution might be stepping back a step and setting the column type as a date when importing with readxl::read_excel() .最好的解决方案可能是在使用readxl::read_excel()导入时readxl::read_excel()一步并将列类型设置为日期。 That will import the field as a date/time, but then as.Date() and group_by() will work as expected.这会将字段导入为日期/时间,但随后as.Date()group_by()将按预期工作。 Example from the vignette : 小插图中的示例:

library(readxl)

df <- read_excel(readxl_example("type-me.xlsx"), sheet = "date_coercion",
                 col_types = c("date", "text")) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM