I have a dataframe has a value of a false, true, or null. I want to create two dataframes, 1) with just the True column names and 2) with just False column names. My initial thoughts are to create two dataframes (since later on they will be appended to a larger dataset) or I also thought about converting the appropriate column names to a list and then the list names to column names.
I'm new to pySpark and trying to figure how to do this without hardcoding any column names(I have a couple hundred columns) I know that I cannot iterate through rows since it would defeat the purpose of pySpark.
Each column will only have one boolen - either a T or F, hence the multiple nulls per column. I tried using.filter but it only filtered one column and it actually printed the all the other column as oppose to just the F columns.
df.filter(df.col1 == 'F').show()
df:
+----+----+----+----+-----+
|Name|col1|col2|col3|col4 |
+----+----+----+----+-----+
| A|null| F | T |null |
| A| F |null|null|null |
| E|null|null|null| T |
+----+----+----+----+-----+
EXPECTED OUTCOME
Dataframe w/ True Column Names:
+------+----+
|col3 |col4|
+------+----+
Dataframe w/ False Column Names (empty dataframe)
+------+----+
|col1 |col2|
+------+----+
You can take the first
of each row with ignorenulls=True
and convert to a dictionary;
import pyspark.sql.functions as F
r = df.select(*[F.first(i,ignorenulls=True).alias(i) for i in df.columns]).collect()
T = [k for k,v in r[0].asDict().items() if v=='T']
F = [k for k,v in r[0].asDict().items() if v=='F']
print(T)
print(F)
#['col3', 'col4']
#['col1', 'col2']
This should do the trick:
import pandas as pd
#get list of columns
dfListCols = df.columns.tolist()
#remove first column 'name'
dfListCols.pop(0)
#create lists for T/F
truesList = list()
falseList = list()
#loop over columns
for col in dfListCols:
#subframe with the current column
tempDf = df[col]
#check if contains T
if 'T' in tempDf.values:
#if yes add to truesList
truesList.append(col)
else:
#if no add to falseList
falseList.append(col)
#get subDFrames
trueDF = df[truesList]
falseDF = df[falseList]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.