简体   繁体   English

如何消除 PySpark DataFrame 列中条目的第一个字符?

[英]How to eliminate the first characters of entries in a PySpark DataFrame column?

So I have a given PySpark DataFrame, say df , looking like below:所以我有一个给定的 PySpark DataFrame,比如说df ,如下所示:

df.show()
+--------------------+-------------------+
|              series|          value    |
+--------------------+-------------------+
|     XXXX-AAAA      |          1        |
|     XXXX-BB        |          2        |
|     XXXX-CCCCC     |          3        |
+--------------------+-------------------+

In the series column, I would like to get rid of the XXXX- substring (ie length of 5 characters), which is the same for each single entry of that column, thus obtaining the followind DataFrame, say df2 :series列中,我想去掉XXXX- (即 5 个字符的长度),该列的每个条目都相同,从而获得以下 DataFrame,例如df2

df2.show()
+--------------------+-------------------+
|              series|          value    |
+--------------------+-------------------+
|        AAAA        |          1        |
|        BB          |          2        |
|        CCCCC       |          3        |
+--------------------+-------------------+

I have tried to use the substr method but, as you can imagine, the challenge lies in the variability of the length of each entry in the series column.我曾尝试使用substr方法,但正如您可以想象的那样,挑战在于series列中每个条目的长度的可变性。 The code that I have tried is the following:我尝试过的代码如下:

from pyspark.sql.functions import length

df2 = df.withColumn('series_truncated', df.timestamp.substr(5, length(df.series)))
df2 = df2.drop('series')
df2 = df2.withColumnRenamed("series_truncated", "series")
df.show()

Together with the associated error:连同相关的错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-6585c018cd92> in <module>
      3 #print(length(df.series))
      4 #df_test.show()
----> 5 df = df.withColumn('series_truncated', df.timestamp.substr(15, length(df.series)))  # Eliminate the first 14 digits of the timestamp column
      6 
      7 #df_test = df.withColumn('length_series', length(df.series))

/usr/local/spark/python/pyspark/sql/column.py in substr(self, startPos, length)
    417                 .format(
    418                     startPos_t=type(startPos),
--> 419                     length_t=type(length),
    420                 ))
    421         if isinstance(startPos, int):

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'pyspark.sql.column.Column'>, respectively.

Does anyone know how this can be solved?有谁知道如何解决这个问题?

Many thanks in advance,提前谢谢了,

Marioanzas马里安萨斯

Two ways:两种方式:

import pyspark.sql.functions as F

df.withColumn('series', F.regexp_extract('series', 'XXXX-(.*)$', 1)).show()
+------+-----+
|series|value|
+------+-----+
|  AAAA|    1|
|    BB|    2|
| CCCCC|    3|
+------+-----+

Or或者

import pyspark.sql.functions as F

df.withColumn('series', F.expr('substring(series, 6, length(series) - 5)')).show()
+------+-----+
|series|value|
+------+-----+
|  AAAA|    1|
|    BB|    2|
| CCCCC|    3|
+------+-----+

Note that the substring function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr in the second method. Note that the substring function in Pyspark API does not accept column objects as arguments, but the Spark SQL API does, so you need to use F.expr in the second method.

from pyspark.sql.functions import substring, length, col, expr
df.withColumn('series',expr('substring(series, 6, length(series))'))

You could also use right function along with lenght to get the rightmost length(series)-5 caracters from the string:您还可以使用right function 和lenght从字符串中获取最右边的length(series)-5个字符:

df.withColumn("series", expr("right(series, length(series)-5)")).show()

#+------+-----+
#|series|value|
#+------+-----+
#|  AAAA|    1|
#|    BB|    2|
#| CCCCC|    3|
#+------+-----+

Or simplyreplace XXXX- by empty string '' :或者简单地将XXXX-replace为空字符串''

df.withColumn("series", expr("replace(series, 'XXXX-', '')")).show()

Note however that replace will get rid of all XXXX- occurrences not only at the beginning of the string.但是请注意,replace 不仅会删除字符串开头的所有XXXX-出现。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从数据框中消除行名和列名的值导致pyspark? - How to eliminate row and column name values from the dataframe result in pyspark? 如何从&#39;pandas.core.frame.DataFrame&#39;中消除第一列 - How to eliminate first column from 'pandas.core.frame.DataFrame' Pyspark dataframe 替换函数:如何处理列名中的特殊字符? - Pyspark dataframe replace functions: How to work with special characters in column names? 如何在 PySpark Dataframe 列中将日期转换为月份的第一天? - How to convert date to the first day of month in a PySpark Dataframe column? 删除 PySpark dataframe 列中的最后几个字符 - remove last few characters in PySpark dataframe column PySpark DataFrame SQL-一些条目中带有特殊字符的列的最大值 - PySpark DataFrame SQL - max value of column with special character in some entries 如何删除字符串 dataframe 列中第一个 n 之后的所有字符? - How to remove all characters after the n first in a string dataframe column? 如何在PySpark中拆分数据框列 - How to split dataframe column in PySpark 如何替换 dataframe 中的字符,其中列可能具有不同的数据类型条目 - how to replace characters in a dataframe where column may have different data types entries 如何有条件地根据列中字符串的长度从PySpark Dataframe StringType()列中删除字符的子字符串? - How to remove a substring of characters from a PySpark Dataframe StringType() column, conditionally based on the length of strings in columns?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM