简体   繁体   English

R或Matlab中的季节性箱形图

[英]Seasonal box plot in R or matlab

在此处输入图片说明

DATE        obs1        obs2    obs3    
1981-01-01  2032.409    3142.46 1741.143
1981-01-02  2023.687    3870.04 1735.256
1981-01-03  2014.274    4126.25 1728.556
1981-01-04  2005.795    2615.91 1722.985
1981-01-05  2000.674    2940.83 1722.317
1981-01-06  1998.477    3258.69 1723.937
1981-01-07  1997.014    3371.6  1724.104
1981-01-08  1995.576    3184.13 1722.624
1981-01-09  1993.706    3540.76 1719.592
1981-01-10  1991.286    3312.43 1715.156
1981-01-11  1988.633    3028.65 1710.141
1981-01-12  1986.147    3212.79 1705.183
1981-01-13  1984.229    3193.23 1700.789
1981-01-14  1982.756    3294.52 1697.785
1981-01-15  1981.548    3553.78 1696.068
1981-01-16  1980.561    3492.28 1694.544
1981-01-17  1979.792    2452.09 1692.36
1981-01-18  1979.224    1873.82 1689.525
1981-01-19  1978.845    3218.28 1686.452

I need to plot seasonal(winter, spring, summer and Fall) box plot in R for the daily data as shown above. 我需要在R中绘制季节性(冬季,春季,夏季和秋季)箱形图,以获取如上所述的每日数据。 I have 10 years of data in above format for different stations. 我有10年以上格式的不同电台数据。 The plot should be in one figure, with multiple box plots in each season. 该地块应该在一个图中,每个季节都有多个箱形地块。

A solution using tidyverse and lubridate . 使用tidyverselubridate tidyverse includes dplyr and tidyr to perform data wrangling, and ggplot2 to create the plot. tidyverse包括dplyrtidyr来执行数据ggplot2 ,而ggplot2来创建图。 lubridate is to handle the dates in the data frame. lubridate用于处理数据框中的日期。

Since the dataset you provided is not very useful, as it only contains a few records only from January thus impossible to create a boxplot showing the seasonal difference, I decided to create a new example data frame. 由于您提供的数据集不是很有用,因为它仅包含一月以来的一些记录,因此无法创建显示季节差异的箱线图,因此我决定创建一个新的示例数据框。 The structure of my example data frame is similar to your dataset, which should give you some hints as a starting point for your real-world problem. 我的示例数据框的结构类似于您的数据集,它应该给您一些提示,作为您解决现实问题的起点。

# Set the seed for reproducibility
set.seed(123)

library(tidyverse)
library(lubridate)

# Create example data frame
dt <- data_frame(DATE = seq(ymd("1980-01-01"), ymd("1989-12-31"), by = 1)) %>%
  mutate(obs1 = rnorm(nrow(.), mean = 0, sd = 1),
         obs2 = rnorm(nrow(.), mean = 1, sd = 2),
         obs3 = rnorm(nrow(.), mean = 2, sd = 3))

head(dt)
# # A tibble: 6 x 4
#         DATE        obs1       obs2      obs3
#       <date>       <dbl>      <dbl>     <dbl>
# 1 1980-01-01 -0.56047565  0.7874145 2.7827006
# 2 1980-01-02 -0.23017749  0.1517417 8.5720252
# 3 1980-01-03  1.55870831  0.7193725 1.3293478
# 4 1980-01-04  0.07050839  0.5454177 0.3253155
# 5 1980-01-05  0.12928774  1.4101239 6.1245771
# 6 1980-01-06  1.71506499 -0.6491910 4.5034395

tail(dt)
# # A tibble: 6 x 4
#         DATE       obs1       obs2       obs3
#       <date>      <dbl>      <dbl>      <dbl>
# 1 1989-12-26 -0.3629796  0.6750946  0.8586325
# 2 1989-12-27  0.1102218  2.8572337  9.8541328
# 3 1989-12-28 -0.2700741  1.7614026  1.9109596
# 4 1989-12-29  0.6920973  0.5275611 -0.4756240
# 5 1989-12-30  0.9282803  1.3811225  1.5222535
# 6 1989-12-31  0.5931301 -1.6638739  4.1157087

The example data frame contains records for 10 years with 3 observation groups. 示例数据框包含具有3个观察组的10年记录。 The values from each column are in normal distribution with different means and standard deviations. 每列的值均呈正态分布,均值和标准差均不同。

The first step is to process the data frame by converting the dataset from wide format to long format and add a column showing season information. 第一步是通过将数据集从宽格式转换为长格式并添加显示季节信息的列来处理数据框。

dt2 <- dt %>% 
  # Convert data frame from lwide format to long format
  gather(Observation, Value, -DATE) %>%
  # Remove "obs" in the Observation column
  mutate(Observation = str_replace(Observation, "obs", "")) %>%
  # Convert the DATE column to date class
  mutate(DATE = ymd(DATE)) %>%
  # Create Month column
  mutate(Month = month(DATE)) %>%
  # Create Season column
  mutate(Season = case_when(
    Month %in% c(12, 1, 2)      ~ "winter",
    Month %in% c(3, 4, 5)       ~ "spring",
    Month %in% c(6, 7, 8)       ~ "summer",
    Month %in% c(9, 10, 11)     ~ "fall",
    TRUE                        ~ NA_character_
  ))

After that, we can use ggplot2 to create the boxplot. 之后,我们可以使用ggplot2创建箱线图。 Notice that I use stat_summary to add a red line to each group to represent the mean. 注意,我使用stat_summary向每个组添加一条红线来表示平均值。

# Create a boxplot using ggplot2
# Specify the aesthetics
ggplot(dt2, aes(x = Season, y = Value, fill = Observation)) + 
  # Specify the geom to be boxplot
  geom_boxplot() +                
  # Add a red line to the mean
  stat_summary(aes(ymax = ..y.., ymin = ..y..),
               fun.y = "mean", 
               geom = "errorbar",              # Use geom_errorbar to add line as mean
               color = "red", 
               width = 0.7, 
               position = position_dodge(width = 0.75), # Add the line to each group
               show.legend = FALSE)

在此处输入图片说明

Ok... the first step is to build up a function that can detect the season in which a given date falls. 好的,第一步是建立一个可以检测给定日期的季节的函数。 Fortunately, I already developed one long time ago that can also handle seasons in the southern hemisphere (which are reversed). 幸运的是,我很早以前就已经开发了它,它也可以处理南半球的季节(相反)。

The function doesn't implement any sanity check, because I was using it with already sanitized datasets, but it should not be hard for you to eventually implement some (unless you decide to sanitize your datasets before using it). 该函数没有实现任何完整性检查,因为我将其与已经清理过的数据集一起使用,但是最终实现它应该并不困难(除非您决定在使用它之前对数据集进行清理)。 It works in a vectorized way to maximize the performance of calculations within Matlab. 它以矢量化的方式工作,以在Matlab中最大化计算的性能。

Here is it: 就这个:

function season = GetSeason(date,southern_hemisphere)

    if (nargin == 1)
        southern_hemisphere = false;
    end

    [~,month,day] = datevec(date);
    offset = month + (day / 100);

    winter = (offset < 3.21) | (offset >= 12.22);
    spring = ~winter & (offset < 6.21);
    summer = ~winter & ~spring & (offset < 9.23);
    autumn = ~winter & ~spring & ~summer;

    offset(spring) = 0;
    offset(summer) = 1;
    offset(autumn) = 2;
    offset(winter) = 3;

    if (southern_hemisphere)
        offset = offset + 2;
    end

    season = mod(offset,4) + 1;
end

Now, the first step, within your script, is to extract your observations from a dataset file. 现在,在脚本中的第一步是从数据集文件中提取观察值。 In order to create a fully working demo for you, I created an Excel dataset. 为了为您创建一个完全正常的演示,我创建了一个Excel数据集。 But you could also use a CSV dataset with almost no changes in the code or other file formats handled by Matlab: 但是,您也可以使用CSV数据集,而Matlab处理的代码或其他文件格式几乎没有变化:

% detect the dataset columns format
opts = detectImportOptions('data.xlsx');

% impose a specific format for the dataset columns
opts = setvartype(opts,{'datetime' 'double' 'double' 'double'});

% extract data in a table variable
data = readtable('data.xlsx',opts);

% sanitize the table variable removing the rows with missing or invalid values
data = rmmissing(data);

% sort the table variable rows by date (default first rows, default ascending)
data = sortrows(data);

The second test consists in getting the corresponding season for the observation dates: 第二项测试包括获取观测日期的相应季节:

seasons = GetSeason(data.Date);

The third step, assuming we are performing all this process only for the first column of observations called Obs1 : 第三步,假设我们仅对观测的第一列Obs1执行所有这些过程:

spring_1 = data.Obs1(seasons == 1);
summer_1 = data.Obs1(seasons == 2);
autumn_1 = data.Obs1(seasons == 3);
winter_1 = data.Obs1(seasons == 4);

The fourth and final step consists in plotting one boxplot for each season in a single graph (the grouping variable groups must be passed as argument into the boxplot function in order to aknowledge the latter of how many boxes it has to draw and using which values): 第四步也是最后一步是在一张图中为每个季节绘制一个箱线图(分组变量groups必须作为参数传递到boxplot函数中,以便了解后者必须绘制多少箱以及使用哪些值)。 :

groups = [
    ones(size(spring_1));
    2 * ones(size(summer_1));
    3 * ones(size(autumn_1));
    4 * ones(size(winter_1));
];

figure();
boxplot([spring_1; summer_1; autumn_1; winter_1],groups);
set(gca,'XTickLabel',{'Spring' 'Summer' 'Autumn' 'Winter'});

And here is the result: 结果如下:

结果

UPDATE WITH FULL WORKING CODE FOR ALL OBSERVATIONS 使用所有观察的完整工作代码进行更新

opts = detectImportOptions('data.xlsx');
opts = setvartype(opts,{'datetime' 'double' 'double' 'double'});

data = readtable('data.xlsx',opts);
data = rmmissing(data);
data = sortrows(data);

seasons = GetSeason(data.Date);

spring_1 = data.Obs1(seasons == 1);
summer_1 = data.Obs1(seasons == 2);
autumn_1 = data.Obs1(seasons == 3);
winter_1 = data.Obs1(seasons == 4);
spring_2 = data.Obs2(seasons == 1);
summer_2 = data.Obs2(seasons == 2);
autumn_2 = data.Obs2(seasons == 3);
winter_2 = data.Obs2(seasons == 4);
spring_3 = data.Obs3(seasons == 1);
summer_3 = data.Obs3(seasons == 2);
autumn_3 = data.Obs3(seasons == 3);
winter_3 = data.Obs3(seasons == 4);

plot_data = [
    spring_1;
    summer_1;
    autumn_1;
    winter_1;
    spring_2; 
    summer_2; 
    autumn_2;
    winter_2;
    spring_3;
    summer_3;
    autumn_3;
    winter_3
];

plot_groups = [
    (1 * ones(size(spring_1))) (1  * ones(size(spring_1)));
    (1 * ones(size(summer_1))) (2  * ones(size(summer_1)));
    (1 * ones(size(autumn_1))) (3  * ones(size(autumn_1)));
    (1 * ones(size(winter_1))) (4  * ones(size(winter_1)));
    (2 * ones(size(spring_2))) (5  * ones(size(spring_2)));
    (2 * ones(size(summer_2))) (6  * ones(size(summer_2)));
    (2 * ones(size(autumn_2))) (7  * ones(size(autumn_2)));
    (2 * ones(size(winter_2))) (8  * ones(size(winter_2)));
    (3 * ones(size(spring_3))) (9  * ones(size(spring_3)));
    (3 * ones(size(summer_3))) (10 * ones(size(summer_3)));
    (3 * ones(size(autumn_3))) (11 * ones(size(autumn_3)));
    (3 * ones(size(winter_3))) (12 * ones(size(winter_3)))
];

labels_obs = {'' '' '' '' '' '' '' '' '' '' '' ''};
labels_season = repmat({'Spring' 'Summer' 'Autumn' 'Winter'},1,3);

figure('Units','normalized','Position',[0.05 0.1 0.9 0.8]);
boxplot(plot_data,plot_groups, ...
    'BoxStyle','outline', ...
    'FactorGap',[5 1], ...
    'Labels',{labels_obs; labels_season}, ...
    'Notch','on');

colors = repmat('wcyg',1,3);
h = findobj(gca,'Tag','Box');

for i = 1:numel(h)
    patch(get(h(i),'XData'),get(h(i),'YData'),colors(i),'FaceAlpha',0.5);
end

h = findall(allchild(findall(gca,'Type','hggroup')),'Type','text','String','');
positions = cell2mat(get(h,'pos'));
positions_new = num2cell([mean(reshape(positions(:,1),4,[]))' positions(1:4:end,2:end)],2);
set(h(1:4:end),{'Position'},positions_new,{'String'},{'Observations 3'; 'Observations 2'; 'Observations 1'})

h = findall(allchild(findall(gca,'Type','hggroup')),'Type','text','String','');
delete(h);

Result: 结果:

结果更新

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM