So I have a given PySpark DataFrame, say df
, looking like below:
df.show()
+--------------------+-------------------+
| series| value |
+--------------------+-------------------+
| XXXX-AAAA | 1 |
| XXXX-BB | 2 |
| XXXX-CCCCC | 3 |
+--------------------+-------------------+
In the series
column, I would like to get rid of the XXXX-
substring (ie length of 5 characters), which is the same for each single entry of that column, thus obtaining the followind DataFrame, say df2
:
df2.show()
+--------------------+-------------------+
| series| value |
+--------------------+-------------------+
| AAAA | 1 |
| BB | 2 |
| CCCCC | 3 |
+--------------------+-------------------+
I have tried to use the substr
method but, as you can imagine, the challenge lies in the variability of the length of each entry in the series
column. The code that I have tried is the following:
from pyspark.sql.functions import length
df2 = df.withColumn('series_truncated', df.timestamp.substr(5, length(df.series)))
df2 = df2.drop('series')
df2 = df2.withColumnRenamed("series_truncated", "series")
df.show()
Together with the associated error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-6585c018cd92> in <module>
3 #print(length(df.series))
4 #df_test.show()
----> 5 df = df.withColumn('series_truncated', df.timestamp.substr(15, length(df.series))) # Eliminate the first 14 digits of the timestamp column
6
7 #df_test = df.withColumn('length_series', length(df.series))
/usr/local/spark/python/pyspark/sql/column.py in substr(self, startPos, length)
417 .format(
418 startPos_t=type(startPos),
--> 419 length_t=type(length),
420 ))
421 if isinstance(startPos, int):
TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'pyspark.sql.column.Column'>, respectively.
Does anyone know how this can be solved?
Many thanks in advance,
Marioanzas
Two ways:
import pyspark.sql.functions as F
df.withColumn('series', F.regexp_extract('series', 'XXXX-(.*)$', 1)).show()
+------+-----+
|series|value|
+------+-----+
| AAAA| 1|
| BB| 2|
| CCCCC| 3|
+------+-----+
Or
import pyspark.sql.functions as F
df.withColumn('series', F.expr('substring(series, 6, length(series) - 5)')).show()
+------+-----+
|series|value|
+------+-----+
| AAAA| 1|
| BB| 2|
| CCCCC| 3|
+------+-----+
Note that the substring
function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr
in the second method.
from pyspark.sql.functions import substring, length, col, expr
df.withColumn('series',expr('substring(series, 6, length(series))'))
You could also use right
function along with lenght
to get the rightmost length(series)-5
caracters from the string:
df.withColumn("series", expr("right(series, length(series)-5)")).show()
#+------+-----+
#|series|value|
#+------+-----+
#| AAAA| 1|
#| BB| 2|
#| CCCCC| 3|
#+------+-----+
Or simplyreplace
XXXX-
by empty string ''
:
df.withColumn("series", expr("replace(series, 'XXXX-', '')")).show()
Note however that replace will get rid of all XXXX-
occurrences not only at the beginning of the string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.