简体   繁体   English

如何从 Pyspark 中的 DataFrame 中获取这种子集?

[英]How to get this kind of subset from a DataFrame in Pyspark?

For example, I have the following DataFrame比如我有下面的DataFrame

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   2 | 20 | b    |
|   3 | 30 | a    |
+-----+----+------+

I want such a subset via following sequential steps:我想要通过以下顺序步骤获得这样的子集:

  1. get all the id of the type a获取type a 的所有id
    • the filtered id are 10 and 30过滤后的id1030
  2. get all the rows where the id are the same as above获取id与上面相同的所有行
    • the rows 0 , 1 and 3 are selected0 , 13被选中

The resulting subset DataFrame is:结果子集 DataFrame 是:

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   0 | 10 | a    |
|   1 | 10 | b    |
|   3 | 30 | a    |
+-----+----+------+

How can I implement this in pyspark ?如何在pyspark中实现这一点? Thanks in advance.提前致谢。


Another follow up question, how to implement the following.另一个后续问题,如何实现以下。

If the step is changed to:如果步骤更改为:

  1. get all the rows where the id are different than above获取id与上面不同的所有行
    • the rows 2 is selected, because only this row's id is not 10 or 302行被选中,因为只有这一行的id不是1030

The resulting DataFrame should be:生成的 DataFrame 应该是:

+-----+----+------+
| idx | id | type |
+-----+----+------+
|   2 | 20 | b    |
+-----+----+------+

You can use filter and join operation.您可以使用过滤器和连接操作。 1. 1.

filterDF = dataDF.filter(dataDF.type == "a")
joinedDS = dataDF.join(filterDF, on="id")

For point number 2 you can use left_anti join对于第 2 点,您可以使用 left_anti 加入

joinedDS1 =  dataDF.join(joinedDS, on="id", how='left_anti')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 python 中的 dataframe 中获取子集 dataframe? - How to get the subset dataframe from the dataframe in python? PySpark:根据列条件使用来自另一个行的行创建子集数据框 - PySpark: Create subset dataframe with rows from another based on a column condition show() 大 dataframe pyspark 的子集 - show() subset of big dataframe pyspark Pyspark - 如何仅将 function 应用于 DataFrame 中的列子集? - Pyspark - How to apply a function only to a subset of columns in a DataFrame? 如何使用字符串(或其他某种元数据)中的逻辑将新列添加到(PySpark)Dataframe? - How do I add new a new column to a (PySpark) Dataframe using logic from a string (or some other kind of metadata)? 如何使用pyspark将没有标题的行从数据框中获取到列表中 - How to get line without header from dataframe into list with pyspark 如何从pyspark的数据框中获取满足条件的列? - How could I get columns that meet a condition from a dataframe in pyspark? 从一行 Dataframe 中获取数据子集 - Get a subset of data from one row of Dataframe PySpark:为所有日期创建 dataframe 的子集 - PySpark: Create a subset of a dataframe for all dates 如何从熊猫数据框中动态选择一个子集? - How to dynamically select a subset from pandas dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM