How to use multiple columns in filter and lambda functions pyspark

Question

I have a dataframe, in which I want to delete columns whose name starts with "test","id_1","vehicle" and so on

I use below code to delete one column

df1.drop(*filter(lambda col: 'test' in col, df.columns))

how to specify all columns at once in this line? this doesnt work:

df1.drop(*filter(lambda col: 'test','id_1' in col, df.columns))

Answer 1

You do something like the following:

expression = lambda col: all([col.startswith(i) for i in ['test', 'id_1', 'vehicle']])
df1.drop(*filter(lambda col:  expression(col), df.columns))

Answer 2

In PySpark version 2.1.0, it is possible to drop multiple columns using drop by providing a list of strings (with the names of the columns you want to drop) as argument to drop . (See documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=drop#pyspark.sql.DataFrame.drop ).

In your case, you may create a list containing the names of the columns you want to drop. For example:

cols_to_drop = [x for x in colunas if (x.startswith('test') or x.startswith('id_1') or x.startswith('vehicle'))]

And then apply the drop unpacking the list:

df1.drop(*cols_to_drop)

Ultimately, it is also possible to achieve a similar result by using select . For example:

# Define columns you want to keep
cols_to_keep = [x for x in df.columns if x not in cols_to_drop]

# create new dataframe, df2, that keeps only the desired columns from df1
df2 = df1.select(cols_to_keep)

Note that, by using select you don't need to unpack the list.

Please note that this question also address similar issue.

I hope this helps.

Answer 3

Well, it seems you can use regular column filter as following:

val forColumns = df.columns.filter(x => (x.startsWith("test") || x.startsWith("id_1") || x.startsWith("vehicle"))) ++ ["c_007"]

df.drop(*forColumns)

How to use multiple columns in filter and lambda functions pyspark

Question

3 answers

solution1
0 2020-02-25 18:48:59

solution2
0 2020-02-25 19:13:53

solution3
0 2020-02-25 20:26:32

How to use multiple columns in filter and lambda functions pyspark

Question

3 answers

solution1 0 2020-02-25 18:48:59

solution2 0 2020-02-25 19:13:53

solution3 0 2020-02-25 20:26:32

solution1
0 2020-02-25 18:48:59

solution2
0 2020-02-25 19:13:53

solution3
0 2020-02-25 20:26:32