简体   繁体   English

按 pySpark 中的条件拆分 dataframe

[英]Splitting dataframe by conditions in pySpark

I have a dataframe has a value of a false, true, or null.我有一个 dataframe 的值为 false、truenull。 I want to create two dataframes, 1) with just the True column names and 2) with just False column names.我想创建两个数据框,1)只有 True 列名和 2)只有 False 列名。 My initial thoughts are to create two dataframes (since later on they will be appended to a larger dataset) or I also thought about converting the appropriate column names to a list and then the list names to column names.我最初的想法是创建两个数据框(因为稍后它们将被附加到更大的数据集)或者我还考虑将适当的列名转换为列表,然后将列表名转换为列名。

I'm new to pySpark and trying to figure how to do this without hardcoding any column names(I have a couple hundred columns) I know that I cannot iterate through rows since it would defeat the purpose of pySpark.我是 pySpark 的新手,并试图弄清楚如何在不硬编码任何列名的情况下做到这一点(我有几百列)我知道我不能遍历行,因为它会破坏 pySpark 的目的。

Each column will only have one boolen - either a T or F, hence the multiple nulls per column.每列只有一个布尔值 - T 或 F,因此每列有多个空值。 I tried using.filter but it only filtered one column and it actually printed the all the other column as oppose to just the F columns.我试过 using.filter 但它只过滤了一列,它实际上打印了所有其他列,而不是 F 列。

df.filter(df.col1 == 'F').show() 
df:
+----+----+----+----+-----+
|Name|col1|col2|col3|col4 |
+----+----+----+----+-----+
|   A|null|  F | T  |null |
|   A| F  |null|null|null |
|   E|null|null|null|  T  |
+----+----+----+----+-----+


EXPECTED OUTCOME

Dataframe w/ True Column Names:
+------+----+
|col3  |col4|
+------+----+

Dataframe w/ False Column Names (empty dataframe)
+------+----+
|col1  |col2|
+------+----+

You can take the first of each row with ignorenulls=True and convert to a dictionary;您可以使用ignorenulls=True每一行的first行并转换为字典;

import pyspark.sql.functions as F
r = df.select(*[F.first(i,ignorenulls=True).alias(i) for i in df.columns]).collect()

T = [k for k,v in r[0].asDict().items() if v=='T']
F = [k for k,v in r[0].asDict().items() if v=='F']

print(T)
print(F)

#['col3', 'col4']
#['col1', 'col2']

This should do the trick:这应该可以解决问题:

import pandas as pd

#get list of columns
dfListCols = df.columns.tolist()
#remove first column 'name'
dfListCols.pop(0)
#create lists for T/F
truesList = list()
falseList = list()
#loop over columns 
for col in dfListCols:
    #subframe with the current column
    tempDf = df[col]
    #check if contains T
    if 'T' in tempDf.values:
        #if yes add to truesList
        truesList.append(col)
    else:
        #if no add to falseList
        falseList.append(col)

#get subDFrames
trueDF = df[truesList]
falseDF = df[falseList]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM