简体   繁体   English

如何使用pyspark仅对spark数据框中的一列进行排序?

[英]How to sort only one column within a spark dataframe using pyspark?

I have a Spark Dataframe looking like this:我有一个像这样的 Spark 数据框:

|  time  | col1 | col2 |
|----------------------|
| 123456 |   2  |  A   |
| 123457 |   4  |  B   |
| 123458 |   7  |  C   |
| 123459 |   5  |  D   |
| 123460 |   3  |  E   |
| 123461 |   1  |  F   |
| 123462 |   9  |  G   |
| 123463 |   8  |  H   |
| 123464 |   6  |  I   |

Now I need to sort the "col1" - Column, but the other columns have to remain in the same order: (Using pyspark)现在我需要对“col1” - 列进行排序,但其他列必须保持相同的顺序:(使用 pyspark)

|  time  | col1 | col2 | col1_sorted |
|-----------------------------------|
|  same  | same | same |   sorted   |
|-----------------------------------|
| 123456 |   2  |  A   |     1      |
| 123457 |   4  |  B   |     2      |
| 123458 |   7  |  C   |     3      |
| 123459 |   5  |  D   |     4      |
| 123460 |   3  |  E   |     5      |
| 123461 |   1  |  F   |     6      |
| 123462 |   9  |  G   |     7      |
| 123463 |   8  |  H   |     8      |
| 123464 |   6  |  I   |     9      |

Thanks in advance for any help!在此先感谢您的帮助!

For Spark 2.3.1 , you can try pandas_udf , see below (assume the original dataframe is sorted by the time column)对于Spark 2.3.1 ,您可以尝试pandas_udf ,见下文(假设原始数据帧按time列排序)

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType

schema = StructType.fromJson(df.schema.jsonValue()).add('col1_sorted', 'integer')

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def get_col1_sorted(pdf):
  return pdf.sort_values(['time']).assign(col1_sorted=sorted(pdf["col1"]))
  
df.groupby().apply(get_col1_sorted).show()
+------+----+----+-----------+
|  time|col1|col2|col1_sorted|
+------+----+----+-----------+
|123456|   2|   A|          1|
|123457|   4|   B|          2|
|123458|   7|   C|          3|
|123459|   5|   D|          4|
|123460|   3|   E|          5|
|123461|   1|   F|          6|
|123462|   9|   G|          7|
|123463|   8|   H|          8|
|123464|   6|   I|          9|
+------+----+----+-----------+

Assuming df is the dataframe having the actual values:假设 df 是具有实际值的数据帧:

import copy
df_schema = copy.deepcopy(df.schema)
new_df = X.rdd.zipWithIndex().toDF(df_schema)
new_df = new_df.orderBy("col1")
df = df.withColumn("col1_sorted", new_df["col1"])
df.show()

My own solution is the following:我自己的解决方案如下:

First make a copy of df with col1 selected and ordered by col1:首先使用 col1 选择并按 col1 排序的 df 副本:

df_copy = df.select("col1").orderBy("col1")

Second indexing both dataframes: (same for df_copy, just with window orderBy("col1"))第二个索引两个数据帧:(df_copy 相同,仅使用窗口 orderBy("col1"))

w = Window.orderBy("time").rowsBetween(-sys.maxsize, 0)

df = df\
            .withColumn("helper", lit(1))\
            .withColumn("index", lit(0))\
            .withColumn("index", F.col("index")+F.sum(F.col("helper")).over(w))

Last step, rename the col1 to col1_sorted and joining the dataframes最后一步,将 col1 重命名为 col1_sorted 并加入数据帧

df_copy = df_copy.withColumnRenamed("col1", "col1_sorted")
    
df = df.join(df_copy, df.index == df_copy.index, how="inner")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用pyspark在Spark DataFrame中解压缩列 - How to unzip a column in a Spark DataFrame using pyspark 如何将新列添加到 Spark DataFrame(使用 PySpark)? - How do I add a new column to a Spark DataFrame (using PySpark)? 如何使用pyspark将列表数组作为新列添加到spark数据帧 - How to add an array of list as a new column to a spark dataframe using pyspark 如何在我的 spark 数据帧中添加来自 spark 数据帧的列(使用 Pyspark)? - How can I add column from a spark dataframe in my spark dataframe(Using Pyspark)? Pyspark通过使用另一列中的值替换Spark dataframe列中的字符串 - Pyspark replace strings in Spark dataframe column by using values in another column 如何在 PySpark 中仅打印 DataFrame 的某一列? - How to print only a certain column of DataFrame in PySpark? Pyspark:如何创建只有一行的数据框? - Pyspark: how to create a dataframe with only one row? 如何根据使用 Pyspark 的条件从另一个表更新 Spark DataFrame 表的列值 - How to update Spark DataFrame Column Values of a table from another table based on a condition using Pyspark pyspark-如何添加列以从列表中激发 dataframe - pyspark- how to add a column to spark dataframe from a list 如何按特定内部元素对 PySpark 中的 dataframe 嵌套数组列进行排序 - How to sort dataframe nested array column in PySpark by specific inner element
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM