In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

Question

I work on DataBricks with PySpark dataframe containing string-type columns. I use.withColumnRenamed() to rename one of them. Later in the process I use a.filter() to select rows that contain a certain substring. I accidentally used the old column name and it still ran the filter and produced the 'correct' results as if I used the new column name. My problem is: is this a bug or a feature?

I reproduced the problem in a simple situation:

_test = sqlContext.createDataFrame([("abcd","efgh"), ("kalp","quarto"), ("aceg","egik")], [ 'x1', 'x2'])
_test.show()

+----+------+
|  x1|    x2|
+----+------+
|abcd|  efgh|
|kalp|quarto|
|aceg|  egik|
+----+------+

_test2 = _test.withColumnRenamed('x1', 'new')

_test2.filter("x1 == 'aceg'").show()

+----+----+
| new|  x2|
+----+----+
|aceg|egik|
+----+----+

_test2.filter("substring(x1,1,2) == 'ka'").show()
+----+------+
| new|    x2|
+----+------+
|kalp|quarto|
+----+------+

I would have expected an error in the filter commands as the column x1 does not exist anymore in "_test2". The weird thing is that the output is showing the new name ('new').

Another example:

_test2.filter("substring(x1,1,1) == 'a'").show()

gives

+----+----+
| new|  x2|
+----+----+
|abcd|efgh|
|aceg|egik|
+----+----+

and _test2.filter("substring(x1,1,1) == 'a'").filter(F.col('x1') == 'abcd').show() gives

+----+----+
| new|  x2|
+----+----+
|abcd|efgh|
+----+----+

However _test2.select(['x1', 'x2']).show() will throw an error that 'x1' does not exist.

Answer 1

This is the known issue of Spark. The community decided not to fix it. See this related jira for more information.

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

Question

1 answers

solution1
1 ACCPTED 2022-04-25 14:13:05

In a pyspark dataframe, when I rename a column, the previous name can still be used for filtering. Bug or feature?

Question

1 answers

solution1 1 ACCPTED 2022-04-25 14:13:05

solution1
1 ACCPTED 2022-04-25 14:13:05