简体   繁体   中英

How to eliminate the first characters of entries in a PySpark DataFrame column?

So I have a given PySpark DataFrame, say df , looking like below:

df.show()
+--------------------+-------------------+
|              series|          value    |
+--------------------+-------------------+
|     XXXX-AAAA      |          1        |
|     XXXX-BB        |          2        |
|     XXXX-CCCCC     |          3        |
+--------------------+-------------------+

In the series column, I would like to get rid of the XXXX- substring (ie length of 5 characters), which is the same for each single entry of that column, thus obtaining the followind DataFrame, say df2 :

df2.show()
+--------------------+-------------------+
|              series|          value    |
+--------------------+-------------------+
|        AAAA        |          1        |
|        BB          |          2        |
|        CCCCC       |          3        |
+--------------------+-------------------+

I have tried to use the substr method but, as you can imagine, the challenge lies in the variability of the length of each entry in the series column. The code that I have tried is the following:

from pyspark.sql.functions import length

df2 = df.withColumn('series_truncated', df.timestamp.substr(5, length(df.series)))
df2 = df2.drop('series')
df2 = df2.withColumnRenamed("series_truncated", "series")
df.show()

Together with the associated error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-6585c018cd92> in <module>
      3 #print(length(df.series))
      4 #df_test.show()
----> 5 df = df.withColumn('series_truncated', df.timestamp.substr(15, length(df.series)))  # Eliminate the first 14 digits of the timestamp column
      6 
      7 #df_test = df.withColumn('length_series', length(df.series))

/usr/local/spark/python/pyspark/sql/column.py in substr(self, startPos, length)
    417                 .format(
    418                     startPos_t=type(startPos),
--> 419                     length_t=type(length),
    420                 ))
    421         if isinstance(startPos, int):

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'pyspark.sql.column.Column'>, respectively.

Does anyone know how this can be solved?

Many thanks in advance,

Marioanzas

Two ways:

import pyspark.sql.functions as F

df.withColumn('series', F.regexp_extract('series', 'XXXX-(.*)$', 1)).show()
+------+-----+
|series|value|
+------+-----+
|  AAAA|    1|
|    BB|    2|
| CCCCC|    3|
+------+-----+

Or

import pyspark.sql.functions as F

df.withColumn('series', F.expr('substring(series, 6, length(series) - 5)')).show()
+------+-----+
|series|value|
+------+-----+
|  AAAA|    1|
|    BB|    2|
| CCCCC|    3|
+------+-----+

Note that the substring function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr in the second method.

from pyspark.sql.functions import substring, length, col, expr
df.withColumn('series',expr('substring(series, 6, length(series))'))

You could also use right function along with lenght to get the rightmost length(series)-5 caracters from the string:

df.withColumn("series", expr("right(series, length(series)-5)")).show()

#+------+-----+
#|series|value|
#+------+-----+
#|  AAAA|    1|
#|    BB|    2|
#| CCCCC|    3|
#+------+-----+

Or simplyreplace XXXX- by empty string '' :

df.withColumn("series", expr("replace(series, 'XXXX-', '')")).show()

Note however that replace will get rid of all XXXX- occurrences not only at the beginning of the string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM