如何消除 PySpark DataFrame 列中条目的第一个字符？

Question

So I have a given PySpark DataFrame, say df , looking like below:所以我有一个给定的 PySpark DataFrame，比如说df ，如下所示：

df.show()
+--------------------+-------------------+
|              series|          value    |
+--------------------+-------------------+
|     XXXX-AAAA      |          1        |
|     XXXX-BB        |          2        |
|     XXXX-CCCCC     |          3        |
+--------------------+-------------------+

In the series column, I would like to get rid of the XXXX- substring (ie length of 5 characters), which is the same for each single entry of that column, thus obtaining the followind DataFrame, say df2 :在series列中，我想去掉XXXX- （即 5 个字符的长度），该列的每个条目都相同，从而获得以下 DataFrame，例如df2 ：

df2.show()
+--------------------+-------------------+
|              series|          value    |
+--------------------+-------------------+
|        AAAA        |          1        |
|        BB          |          2        |
|        CCCCC       |          3        |
+--------------------+-------------------+

I have tried to use the substr method but, as you can imagine, the challenge lies in the variability of the length of each entry in the series column.我曾尝试使用substr方法，但正如您可以想象的那样，挑战在于series列中每个条目的长度的可变性。 The code that I have tried is the following:我尝试过的代码如下：

from pyspark.sql.functions import length

df2 = df.withColumn('series_truncated', df.timestamp.substr(5, length(df.series)))
df2 = df2.drop('series')
df2 = df2.withColumnRenamed("series_truncated", "series")
df.show()

Together with the associated error:连同相关的错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-6585c018cd92> in <module>
      3 #print(length(df.series))
      4 #df_test.show()
----> 5 df = df.withColumn('series_truncated', df.timestamp.substr(15, length(df.series)))  # Eliminate the first 14 digits of the timestamp column
      6 
      7 #df_test = df.withColumn('length_series', length(df.series))

/usr/local/spark/python/pyspark/sql/column.py in substr(self, startPos, length)
    417                 .format(
    418                     startPos_t=type(startPos),
--> 419                     length_t=type(length),
    420                 ))
    421         if isinstance(startPos, int):

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'pyspark.sql.column.Column'>, respectively.

Does anyone know how this can be solved?有谁知道如何解决这个问题？

Many thanks in advance,提前谢谢了，

Marioanzas马里安萨斯

Answer 1

Two ways:两种方式：

import pyspark.sql.functions as F

df.withColumn('series', F.regexp_extract('series', 'XXXX-(.*)$', 1)).show()
+------+-----+
|series|value|
+------+-----+
|  AAAA|    1|
|    BB|    2|
| CCCCC|    3|
+------+-----+

Or或者

import pyspark.sql.functions as F

df.withColumn('series', F.expr('substring(series, 6, length(series) - 5)')).show()
+------+-----+
|series|value|
+------+-----+
|  AAAA|    1|
|    BB|    2|
| CCCCC|    3|
+------+-----+

Note that the substring function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr in the second method. Note that the substring function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr in the second method.

Answer 2

from pyspark.sql.functions import substring, length, col, expr
df.withColumn('series',expr('substring(series, 6, length(series))'))

Answer 3

You could also use right function along with lenght to get the rightmost length(series)-5 caracters from the string:您还可以使用right function 和lenght从字符串中获取最右边的length(series)-5个字符：

df.withColumn("series", expr("right(series, length(series)-5)")).show()

#+------+-----+
#|series|value|
#+------+-----+
#|  AAAA|    1|
#|    BB|    2|
#| CCCCC|    3|
#+------+-----+

Or simplyreplace XXXX- by empty string '' :或者简单地将XXXX-replace为空字符串'' ：

df.withColumn("series", expr("replace(series, 'XXXX-', '')")).show()

Note however that replace will get rid of all XXXX- occurrences not only at the beginning of the string.但是请注意，replace 不仅会删除字符串开头的所有XXXX-出现。

如何消除 PySpark DataFrame 列中条目的第一个字符？

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-01-15 10:06:39

解决方案2
1 2021-01-15 10:07:47

解决方案3
1 2021-01-15 10:26:59

如何消除 PySpark DataFrame 列中条目的第一个字符？

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-01-15 10:06:39

解决方案2 1 2021-01-15 10:07:47

解决方案3 1 2021-01-15 10:26:59

解决方案1
1 已采纳 2021-01-15 10:06:39

解决方案2
1 2021-01-15 10:07:47

解决方案3
1 2021-01-15 10:26:59