簡體   English   中英

按年份計算缺失數字的百分比

[英]calculate the percent of missing numbers by year

我有一個每日最低溫度,最高溫度,最低露點和最高露點。 此數據包含NaN,因此我想知道在給定年份中丟失了百分之幾的數據(NaN),然后是按列划分的所有數據中的百分比總數;

計算年份中該列中NaN的百分比以及整個期間(1948-2018)的總百分比

我的數據是

 Station Date    Month  Day Year    MaxTemp MinTemp MaxDewPoint MinDewPoint
    ORD 1/1/1948    1   1   1948    35.6    26.6    34.16         -27.4
    ORD 1/2/1948    1   2   1948    -2      -16     -16.96       -27.04
    ORD 1/3/1948    1   3   1948    -4      -26     -12            -26
    ORD 1/4/1948    1   4   1948    -5      -26     -15             -26
    ORD 1/5/1948    1   5   1948    8       -25     3               NaN
    ORD 1/6/1948    1   6   1948    -11     -25     -24            -25
    ORD 1/7/1948    1   7   1948    1       -23     NaN            -23
    ORD 1/8/1948    1   8   1948    1       -22     -9              NaN
    ORD 1/9/1948    1   9   1948    NaN     -22     -5             -22
    ORD 1/10/1948   1   10  1948    10      NaN     -2              -22
    ORD 1/11/1948   1   11  1948    -11     -21    -23              -21
    ORD 1/12/1948   1   12  1948    3       -12     -7.96        -20.92
    ORD 1/13/1948   1   13  1948    6.98    -7.6    -7.6         -20.2
    ORD 1/14/1948   1   14  1948    3.92    -9.4    -11.2        NaN
    ORD 1/15/1948   1   15  1948    6        -7    -5.98         NaN
    ORD 1/16/1948   1   16  1948    3       -11     -7.96       -20.02

到目前為止,我的代碼

    install.packages("dplyr")
library(dplyr)
install.packages("stringr")
library(stringr)
#setting up workspace in the folder#
setwd("D:/Climate Data Analysis/Asignment 1")
#opening a CSV file in r program#
data<- read.csv("chiacagost.csv", header=TRUE, sep=",")
#making data frame of the variables#
dframe<- data.frame(data)
# Missing percentage of the data by column

MisMxTMP<-dframe%>%summarise(NAMisMxTMP=sum(is.na(Max.Temp)/length(Max.Temp)))*100
misMnTMP<-dframe%>%summarise(NAmisMnTMPL=sum(is.na(Min.Temp)/length(Min.Temp)))*100
MisMxDTMP<-dframe%>%summarise(NAMisMxDTMP=sum(is.na(Max.Dew.Point)/length(Max.Dew.Point)))*100
MisMnDTMP<-dframe%>%summarise(NAMisMnDTMP=sum(is.na(Min.Dew.Point)/length(Min.Dew.Point)))*100

我能夠計算丟失數據的總數百分比,但我想按年份知道,這樣我就可以在分析中排除丟失百分比最大的年份

要按年份和變量計算丟失數據的百分比:

> dframe %>% 
+     tidyr::gather(var, value, MaxTemp, MinTemp, MaxDewPoint, MinDewPoint) %>% 
+     dplyr::group_by(Year, var) %>% 
+     dplyr::summarise(pct_na = sum(is.nan(value)) / n())
# A tibble: 4 x 3
# Groups:   Year [?]
   Year var         pct_na
  <int> <chr>        <dbl>
1  1948 MaxDewPoint 0.0625
2  1948 MaxTemp     0.0625
3  1948 MinDewPoint 0.25  
4  1948 MinTemp     0.0625

要獲得全年丟失數據的百分比,只需將group_by(Year, var)更改為group_by(Year)

數據

dframe <- read.table(textConnection(gsub(" ORD ", "\nORD ", "Station Date Month Day Year MaxTemp MinTemp MaxDewPoint MinDewPoint ORD 1/1/1948 1 1 1948 35.6 26.6 34.16 -27.4 ORD 1/2/1948 1 2 1948 -2 -16 -16.96 -27.04 ORD 1/3/1948 1 3 1948 -4 -26 -12 -26 ORD 1/4/1948 1 4 1948 -5 -26 -15 -26 ORD 1/5/1948 1 5 1948 8 -25 3 NaN ORD 1/6/1948 1 6 1948 -11 -25 -24 -25 ORD 1/7/1948 1 7 1948 1 -23 NaN -23 ORD 1/8/1948 1 8 1948 1 -22 -9 NaN ORD 1/9/1948 1 9 1948 NaN -22 -5 -22 ORD 1/10/1948 1 10 1948 10 NaN -2 -22 ORD 1/11/1948 1 11 1948 -11 -21 -23 -21 ORD 1/12/1948 1 12 1948 3 -12 -7.96 -20.92 ORD 1/13/1948 1 13 1948 6.98 -7.6 -7.6 -20.2 ORD 1/14/1948 1 14 1948 3.92 -9.4 -11.2 NaN ORD 1/15/1948 1 15 1948 6 -7 -5.98 NaN ORD 1/16/1948 1 16 1948 3 -11 -7.96 -20.02")), header = T)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM