Spark Scala Dataframe describe non numeric columns

Question

Is there a function similar to describe() for non numeric columns?

I'd like to gather stats about the 'data completeness' of my table. EG

total number of records
total number of null values
total number of special values (eg 0s, empty strings, etc...)
total number of distinct values
other stuff like this...

data.describe() produces interesting values (count, mean, stddev, min, max) for numeric columns only. Is there anything that works well with Strings or other types?

Answer 1

There isn't. The problem is that basics statistics on numerical data are cheap. On categorical data some of these may require multiple data scans and unbounded (linear in terms of the number of records) memory.

Some are very cheap. For example counting NULL or empty: Count number of non-NaN entries in each column of Spark dataframe with Pyspark

Answer 2

Here is an example of getting string columns statistics described in question:

  def getStringColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isEmpty", when(col(columnName) === "", true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .withColumn("fieldLen", length(col(columnName)))
      .agg(
        max(col("fieldLen")).as("max_length"),
        countDistinct(columnName).as("unique"),
        count("isEmpty").as("is_empty"),
        count("isNull").as("is_null")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileStringColumns(df: DataFrame): DataFrame = {
      df.columns.filter(df.schema(_).dataType == StringType)
        .map(getStringColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name"
          , "unique"
          , "is_empty"
          , "is_null"
          , "max_length")
    }

and this is the same for numeric columns

  def getNumericColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isZero", when(col(columnName) === 0, true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .agg(
        max(col(columnName)).as("max"),
        count("isZero").as("is_zero"),
        count("isNull").as("is_null"),
        min(col(columnName)).as("min"),
        avg(col(columnName)).as("avg"),
        stddev(col(columnName)).as("std_dev")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileNumericColumns(df: DataFrame): DataFrame = {
      df.columns.filter(
        Set("DecimalType", "IntegerType", "LongType", "DoubleType", "FloatType", "ShortType")
          contains df.schema(_).dataType.toString
      )
        .map(getNumericColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name",
          "col_type",
          "is_null",
          "is_zero",
          "min",
          "max",
          "avg",
          "std_dev")
    }

Answer 3

Here is a bit of code to help solve the problem of profiling non-numeric data. Please see:
https://github.com/jasonsatran/spark-meta/

To help with performance, we can sample the data or select only the columns that we want to explicitly profile.

Spark Scala Dataframe describe non numeric columns

Question

3 answers

solution1
1 2017-03-19 12:46:55

solution2
1 2020-09-14 14:47:50

solution3
0 2017-08-18 19:09:16

Spark Scala Dataframe describe non numeric columns

Question

3 answers

solution1 1 2017-03-19 12:46:55

solution2 1 2020-09-14 14:47:50

solution3 0 2017-08-18 19:09:16

solution1
1 2017-03-19 12:46:55

solution2
1 2020-09-14 14:47:50

solution3
0 2017-08-18 19:09:16