使用 Spark DataFrame 列制作直方图

Question

I am trying to make a histogram with a column from a dataframe which looks like我正在尝试使用数据框中的列制作直方图，看起来像

DataFrame[C0: int, C1: int, ...]

If I were to make a histogram with the column C1, what should I do?如果我要使用列 C1 制作直方图，我该怎么办？

Some things I have tried are我尝试过的一些事情是

df.groupBy("C1").count().histogram()
df.C1.countByValue()

Which do not work because of mismatch in data types.由于数据类型不匹配而不起作用。

Answer 1

You can use histogram_numeric Hive UDAF:您可以使用histogram_numeric Hive UDAF：

import random

random.seed(323)

sqlContext = HiveContext(sc)
n = 3  # Number of buckets
df = sqlContext.createDataFrame(
    sc.parallelize(enumerate(random.random() for _ in range(1000))),
   ["id", "v"]
)

hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))

hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3)                                                              |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+

You can also extract the column of interest and use histogram method on RDD :您还可以提取感兴趣的列并在RDD上使用histogram方法：

df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
##  0.33410233677189705,
##  0.6661765640094703,
##  0.9982507912470436],
## [327, 326, 347])

Answer 2

What worked for me is对我有用的是

df.groupBy("C1").count().rdd.values().histogram()

I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module我必须转换为 RDD，因为我在 pyspark.RDD 类中找到了histogram方法，但在 spark.SQL 模块中没有

Answer 3

The pyspark_dist_explore package that @Chris van den Berg mentioned is quite nice. @Chris van den Berg 提到的pyspark_dist_explore包非常好。 If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.如果您不想添加额外的依赖项，您可以使用这段代码来绘制一个简单的直方图。

import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)

# This is a bit awkward but I believe this is the correct way to do it 
plt.hist(bins[:-1], bins=bins, weights=counts)

Answer 4

Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins.假设您在 C1 中的值在 1-1000 之间，并且您想要获得 10 个 bin 的直方图。 You can do something like: df.withColumn("bins", df.C1/100).groupBy("bins").count() If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, eg by using describe or through some other method).您可以执行以下操作： df.withColumn("bins", df.C1/100).groupBy("bins").count() 如果您的分箱更复杂，您可以为它制作一个 UDF（更糟的是，您可能需要首先分析该列，例如通过使用 describe 或通过其他一些方法）。

Answer 5

If you want a to plot the Histogram, you could use the pyspark_dist_explore package:如果你想绘制直方图，你可以使用pyspark_dist_explore包：

fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))

If you would like the data in a pandas DataFrame you could use:如果您想要 Pandas DataFrame 中的数据，您可以使用：

pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))

Answer 6

One easy way could be一种简单的方法可能是

import pandas as pd
x = df.select('symboling').toPandas()  # symboling is the column for histogram
x.plot(kind='hist')

使用 Spark DataFrame 列制作直方图

问题描述

6 个解决方案

解决方案1
14 2016-03-16 18:55:34

解决方案2
14 2016-03-17 12:05:05

解决方案3
14 2017-08-22 19:49:08

解决方案4
2 2016-03-16 18:30:39

解决方案5
2 2017-07-18 15:05:09

解决方案6
-2 2020-05-26 03:52:20

使用 Spark DataFrame 列制作直方图

问题描述

6 个解决方案

解决方案1 14 2016-03-16 18:55:34

解决方案2 14 2016-03-17 12:05:05

解决方案3 14 2017-08-22 19:49:08

解决方案4 2 2016-03-16 18:30:39

解决方案5 2 2017-07-18 15:05:09

解决方案6 -2 2020-05-26 03:52:20

解决方案1
14 2016-03-16 18:55:34

解决方案2
14 2016-03-17 12:05:05

解决方案3
14 2017-08-22 19:49:08

解决方案4
2 2016-03-16 18:30:39

解决方案5
2 2017-07-18 15:05:09

解决方案6
-2 2020-05-26 03:52:20