简体   繁体   中英

How can I summarize character columns in my dataframe in R?

With numeric columns and factor columns, summary() provides some information useful in understanding the data. For example, this output using the iris dataset: 在此处输入图像描述

Here, we see min, 1st quartile, median, mean, 3rd quartile, and max for the numeric columns, which is helpful for a quick spot-check. We also see counts on the factor column.

Running the following code just to create an all-character-column data frame and checking summary() , we get a result that isn't very helpful as a summary of the values in my data (at least for the purposes that I'm interested in).

  iris2<-iris%>%
        mutate_all(as.character)

summary(iris2)

在此处输入图像描述

In general, I'd like to have something more like the results I get for factor columns when I use summary() with character columns.

I realize that I can convert my character columns to factor and then run summary() with something like the below:

  iris3<-iris2%>%
    mutate_all(as.factor)

  summary(iris3)

在此处输入图像描述

Is there a way that I can avoid having to make the extra step in order to spot-check my data? I ultimately want to keep working with the data as character columns rather than factor, and would prefer not to have to switch back and forth between the data types. It wouldn't matter to me if this conversion is happening "behind the scenes". For what it's worth, an expanded summary() in the case of the numeric columns that included some of the high-frequency values would be interesting as well. Thank you in advance for any help in finding a way.

If it is to get an overall summary of the dataset, skim may be useful

skimr::skim(iris)

-output

── Data Summary ────────────────────────
                           Values
Name                       iris  
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  

── Variable type: factor ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts               
1 Species               0             1 FALSE          3 set: 50, ver: 50, vir: 50

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
1 Sepal.Length          0             1  5.84 0.828   4.3   5.1  5.8    6.4   7.9 ▆▇▇▅▂
2 Sepal.Width           0             1  3.06 0.436   2     2.8  3      3.3   4.4 ▁▆▇▂▁
3 Petal.Length          0             1  3.76 1.77    1     1.6  4.35   5.1   6.9 ▇▁▆▇▂
4 Petal.Width           0             1  1.20 0.762   0.1   0.3  1.3    1.8   2.5 ▇▁▇▅▃

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM