Spark Scala Dataframe 描述非數字列

Question

對於非數字列，是否有類似於describe()的函數？

我想收集有關表格“數據完整性”的統計信息。 例如

記錄總數
空值的總數
特殊值的總數（例如 0、空字符串等...）
不同值的總數
像這樣的其他東西......

data.describe() 僅為數字列生成有趣的值（計數、平均值、標准偏差、最小值、最大值）。 有什么適合字符串或其他類型的東西嗎？

Answer 1

沒有。 問題是數值數據的基礎統計數據很便宜。 在分類數據上，其中一些可能需要多次數據掃描和無限（就記錄數而言是線性的）內存。

有些非常便宜。 例如計算 NULL 或空：使用 Pyspark 計算 Spark 數據幀的每一列中非 NaN 條目的數量

Answer 2

以下是獲取問題中描述的字符串列統計信息的示例：

  def getStringColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isEmpty", when(col(columnName) === "", true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .withColumn("fieldLen", length(col(columnName)))
      .agg(
        max(col("fieldLen")).as("max_length"),
        countDistinct(columnName).as("unique"),
        count("isEmpty").as("is_empty"),
        count("isNull").as("is_null")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileStringColumns(df: DataFrame): DataFrame = {
      df.columns.filter(df.schema(_).dataType == StringType)
        .map(getStringColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name"
          , "unique"
          , "is_empty"
          , "is_null"
          , "max_length")
    }

這對於數字列也是一樣的

  def getNumericColumnProfile(df: DataFrame, columnName: String): DataFrame = {
    df.select(columnName)
      .withColumn("isZero", when(col(columnName) === 0, true).otherwise(null))
      .withColumn("isNull", when(col(columnName).isNull, true).otherwise(null))
      .agg(
        max(col(columnName)).as("max"),
        count("isZero").as("is_zero"),
        count("isNull").as("is_null"),
        min(col(columnName)).as("min"),
        avg(col(columnName)).as("avg"),
        stddev(col(columnName)).as("std_dev")
      )
      .withColumn("col_name", lit(columnName))
  }

    def profileNumericColumns(df: DataFrame): DataFrame = {
      df.columns.filter(
        Set("DecimalType", "IntegerType", "LongType", "DoubleType", "FloatType", "ShortType")
          contains df.schema(_).dataType.toString
      )
        .map(getNumericColumnProfile(df, _))
        .reduce(_ union _)
        .toDF
        .select("col_name",
          "col_type",
          "is_null",
          "is_zero",
          "min",
          "max",
          "avg",
          "std_dev")
    }

Answer 3

這里有一些代碼可以幫助解決分析非數字數據的問題。 請參見：
https://github.com/jasonsatran/spark-meta/

為了提高性能，我們可以對數據進行采樣或僅選擇要明確分析的列。

Spark Scala Dataframe 描述非數字列

問題描述

3 個解決方案

解決方案1
1 2017-03-19 12:46:55

解決方案2
1 2020-09-14 14:47:50

解決方案3
0 2017-08-18 19:09:16

Spark Scala Dataframe 描述非數字列

問題描述

3 個解決方案

解決方案1 1 2017-03-19 12:46:55

解決方案2 1 2020-09-14 14:47:50

解決方案3 0 2017-08-18 19:09:16

解決方案1
1 2017-03-19 12:46:55

解決方案2
1 2020-09-14 14:47:50

解決方案3
0 2017-08-18 19:09:16