简体   繁体   English

Spark Scala Dataframe 描述非数字列

[英]Spark Scala Dataframe describe non numeric columns

Is there a function similar to describe() for non numeric columns?对于非数字列,是否有类似于describe()的函数?

I'd like to gather stats about the 'data completeness' of my table.我想收集有关表格“数据完整性”的统计信息。 EG例如

  • total number of records记录总数
  • total number of null values空值的总数
  • total number of special values (eg 0s, empty strings, etc...)特殊值的总数(例如 0、空字符串等...)
  • total number of distinct values不同值的总数
  • other stuff like this...像这样的其他东西......

data.describe() produces interesting values (count, mean, stddev, min, max) for numeric columns only. data.describe() 仅为数字列生成有趣的值(计数、平均值、标准偏差、最小值、最大值)。 Is there anything that works well with Strings or other types?有什么适合字符串或其他类型的东西吗?

There isn't.没有。 The problem is that basics statistics on numerical data are cheap.问题是数值数据的基础统计数据很便宜。 On categorical data some of these may require multiple data scans and unbounded (linear in terms of the number of records) memory.在分类数据上,其中一些可能需要多次数据扫描和无限(就记录数而言是线性的)内存。

Some are very cheap.有些非常便宜。 For example counting NULL or empty: Count number of non-NaN entries in each column of Spark dataframe with Pyspark例如计算 NULL 或空:使用 Pyspark 计算 Spark 数据帧的每一列中非 NaN 条目的数量

Here is an example of getting string columns statistics described in question:以下是获取问题中描述的字符串列统计信息的示例:

  def getStringColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isEmpty", when(col(columnName) === "", true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .withColumn("fieldLen", length(col(columnName)))
      .agg(
        max(col("fieldLen")).as("max_length"),
        countDistinct(columnName).as("unique"),
        count("isEmpty").as("is_empty"),
        count("isNull").as("is_null")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileStringColumns(df: DataFrame): DataFrame = {
      df.columns.filter(df.schema(_).dataType == StringType)
        .map(getStringColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name"
          , "unique"
          , "is_empty"
          , "is_null"
          , "max_length")
    }

and this is the same for numeric columns这对于数字列也是一样的

  def getNumericColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isZero", when(col(columnName) === 0, true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .agg(
        max(col(columnName)).as("max"),
        count("isZero").as("is_zero"),
        count("isNull").as("is_null"),
        min(col(columnName)).as("min"),
        avg(col(columnName)).as("avg"),
        stddev(col(columnName)).as("std_dev")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileNumericColumns(df: DataFrame): DataFrame = {
      df.columns.filter(
        Set("DecimalType", "IntegerType", "LongType", "DoubleType", "FloatType", "ShortType")
          contains df.schema(_).dataType.toString
      )
        .map(getNumericColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name",
          "col_type",
          "is_null",
          "is_zero",
          "min",
          "max",
          "avg",
          "std_dev")
    }

Here is a bit of code to help solve the problem of profiling non-numeric data.这里有一些代码可以帮助解决分析非数字数据的问题。 Please see:请参见:
https://github.com/jasonsatran/spark-meta/ https://github.com/jasonsatran/spark-meta/

To help with performance, we can sample the data or select only the columns that we want to explicitly profile.为了提高性能,我们可以对数据进行采样或仅选择要明确分析的列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM