[英]How to eliminate the first characters of entries in a PySpark DataFrame column?
So I have a given PySpark DataFrame, say df
, looking like below:所以我有一个给定的 PySpark DataFrame,比如说
df
,如下所示:
df.show()
+--------------------+-------------------+
| series| value |
+--------------------+-------------------+
| XXXX-AAAA | 1 |
| XXXX-BB | 2 |
| XXXX-CCCCC | 3 |
+--------------------+-------------------+
In the series
column, I would like to get rid of the XXXX-
substring (ie length of 5 characters), which is the same for each single entry of that column, thus obtaining the followind DataFrame, say df2
:在
series
列中,我想去掉XXXX-
(即 5 个字符的长度),该列的每个条目都相同,从而获得以下 DataFrame,例如df2
:
df2.show()
+--------------------+-------------------+
| series| value |
+--------------------+-------------------+
| AAAA | 1 |
| BB | 2 |
| CCCCC | 3 |
+--------------------+-------------------+
I have tried to use the substr
method but, as you can imagine, the challenge lies in the variability of the length of each entry in the series
column.我曾尝试使用
substr
方法,但正如您可以想象的那样,挑战在于series
列中每个条目的长度的可变性。 The code that I have tried is the following:我尝试过的代码如下:
from pyspark.sql.functions import length
df2 = df.withColumn('series_truncated', df.timestamp.substr(5, length(df.series)))
df2 = df2.drop('series')
df2 = df2.withColumnRenamed("series_truncated", "series")
df.show()
Together with the associated error:连同相关的错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-6585c018cd92> in <module>
3 #print(length(df.series))
4 #df_test.show()
----> 5 df = df.withColumn('series_truncated', df.timestamp.substr(15, length(df.series))) # Eliminate the first 14 digits of the timestamp column
6
7 #df_test = df.withColumn('length_series', length(df.series))
/usr/local/spark/python/pyspark/sql/column.py in substr(self, startPos, length)
417 .format(
418 startPos_t=type(startPos),
--> 419 length_t=type(length),
420 ))
421 if isinstance(startPos, int):
TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'pyspark.sql.column.Column'>, respectively.
Does anyone know how this can be solved?有谁知道如何解决这个问题?
Many thanks in advance,提前谢谢了,
Marioanzas马里安萨斯
Two ways:两种方式:
import pyspark.sql.functions as F
df.withColumn('series', F.regexp_extract('series', 'XXXX-(.*)$', 1)).show()
+------+-----+
|series|value|
+------+-----+
| AAAA| 1|
| BB| 2|
| CCCCC| 3|
+------+-----+
Or或者
import pyspark.sql.functions as F
df.withColumn('series', F.expr('substring(series, 6, length(series) - 5)')).show()
+------+-----+
|series|value|
+------+-----+
| AAAA| 1|
| BB| 2|
| CCCCC| 3|
+------+-----+
Note that the substring
function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr
in the second method. Note that the
substring
function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr
in the second method.
from pyspark.sql.functions import substring, length, col, expr
df.withColumn('series',expr('substring(series, 6, length(series))'))
You could also use right
function along with lenght
to get the rightmost length(series)-5
caracters from the string:您还可以使用
right
function 和lenght
从字符串中获取最右边的length(series)-5
个字符:
df.withColumn("series", expr("right(series, length(series)-5)")).show()
#+------+-----+
#|series|value|
#+------+-----+
#| AAAA| 1|
#| BB| 2|
#| CCCCC| 3|
#+------+-----+
Or simplyreplace
XXXX-
by empty string ''
:或者简单地将
XXXX-
replace
为空字符串''
:
df.withColumn("series", expr("replace(series, 'XXXX-', '')")).show()
Note however that replace will get rid of all XXXX-
occurrences not only at the beginning of the string.但是请注意,replace 不仅会删除字符串开头的所有
XXXX-
出现。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.