简体   繁体   中英

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

I work on DataBricks with PySpark dataframe containing string-type columns. I use.withColumnRenamed() to rename one of them. Later in the process I use a.filter() to select rows that contain a certain substring. I accidentally used the old column name and it still ran the filter and produced the 'correct' results as if I used the new column name. My problem is: is this a bug or a feature?

I reproduced the problem in a simple situation:

_test = sqlContext.createDataFrame([("abcd","efgh"), ("kalp","quarto"), ("aceg","egik")], [ 'x1', 'x2'])
_test.show()

+----+------+
|  x1|    x2|
+----+------+
|abcd|  efgh|
|kalp|quarto|
|aceg|  egik|
+----+------+

_test2 = _test.withColumnRenamed('x1', 'new')

_test2.filter("x1 == 'aceg'").show()

+----+----+
| new|  x2|
+----+----+
|aceg|egik|
+----+----+
_test2.filter("substring(x1,1,2) == 'ka'").show()
+----+------+
| new|    x2|
+----+------+
|kalp|quarto|
+----+------+

I would have expected an error in the filter commands as the column x1 does not exist anymore in "_test2". The weird thing is that the output is showing the new name ('new').

Another example:

_test2.filter("substring(x1,1,1) == 'a'").show()

gives

+----+----+
| new|  x2|
+----+----+
|abcd|efgh|
|aceg|egik|
+----+----+

and _test2.filter("substring(x1,1,1) == 'a'").filter(F.col('x1') == 'abcd').show() gives

+----+----+
| new|  x2|
+----+----+
|abcd|efgh|
+----+----+

However _test2.select(['x1', 'x2']).show() will throw an error that 'x1' does not exist.

This is the known issue of Spark. The community decided not to fix it. See this related jira for more information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM