简体   繁体   English

替换Pyspark中数据框中的值的子字符串

[英]Replace SubString of values in a dataframe in Pyspark

I have a dataframe with some attributes and it has the next appearence: 我有一个带有某些属性的数据框,它具有下一个外观:

+-------+-------+
| Atr1  | Atr2  |
+-------+-------+
|  3,06 |  4,08 |
|  3,03 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  ...  |  ...  |
+-------+-------+

As you can see, the values of the Atr1 and Atr2 of the dataframe are numbers that has a ',' character. 如您所见,数据帧的Atr1和Atr2的值是带有','字符的数字。 This is because I have loaded those data from a CSV where the decimals of the DoubleType numbers were represented by ','. 这是因为我已经从CSV加载了这些数据,其中DoubleType数字的小数表示为“,”。

When I load the data into a dataframe the values are cast to String, so I applied a casting from String to DoubleType for those attributes like this: 当我将数据加载到数据帧中时,值将强制转换为String,因此我对此类属性从String强制转换为DoubleType:

df = df.withColumn("Atr1", df["Atr1"].cast(DoubleType()))
df = df.withColumn("Atr2", df["Atr2"].cast(DoubleType()))

But when I do it, the values are converted to null 但是当我这样做时,值将转换为null

+-------+-------+
| Atr1  | Atr2  |
+-------+-------+
|  null |  null |
|  null |  null |
|  null |  null |
|  null |  null |
|  null |  null |
|  ...  |  ...  |
+-------+-------+

I guess that the reason is that DoubleType decimals must be separated by '.' 我猜想原因是DoubleType小数必须用'。'分隔。 instead of by ','. 而不是','。 But I don't have the chance to edit the CSV file, so I want to replace the ',' signs from the Dataframe by '.' 但是我没有机会编辑CSV文件,因此我想用“。”替换数据框中的“,”符号。 and then apply the casting to DoubleType. 然后将转换应用于DoubleType。

How could I do it? 我该怎么办?

You can simply solve this problem by using an user defined function. 您可以使用用户定义的函数简单地解决此问题。

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *

data = [Row(Atr1="3,06", Atr2="4,08"),
        Row(Atr1="3,06", Atr2="4,08"),
        Row(Atr1="3,06", Atr2="4,08")]

df = sqlContext.createDataFrame(data)

# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())

out = df
   .withColumn("Atr1", udf(col("Atr1")).cast(DoubleType()))
   .withColumn("Atr2", udf(col("Atr2")).cast(DoubleType()))

##############################################################
out.show()

+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+

##############################################################

out.printSchema()

root
 |-- Atr1: double (nullable = true)
 |-- Atr2: double (nullable = true)

EDIT: More compact solution following suggestion from comments. 编辑:根据评论的建议提供更紧凑的解决方案。

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *

udf = UserDefinedFunction(lambda x: float(x.replace(",",".")), DoubleType())

out = df
    .withColumn("Atr1", udf(col("Atr1")))
    .withColumn("Atr2", udf(col("Atr2")))

Let's assume you have: 假设您有:

sdf.show()
+-------+-------+
|   Atr1|   Atr2|
+-------+-------+
|  3,06 |  4,08 |
|  3,03 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
+-------+-------+

Then the following code will produce the desired result: 然后,以下代码将产生所需的结果:

strToDouble = udf(lambda x: float(x.replace(",",".")), DoubleType())

sdf = sdf.withColumn("Atr1", strToDouble(sdf['Atr1']))
sdf = sdf.withColumn("Atr2", strToDouble(sdf['Atr2']))

sdf.show()
+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.03|4.08|
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+

You can also do it with just SQL. 您也可以只使用SQL来完成。

val df = sc.parallelize(Array(
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08")
      )).toDF("a", "b")

df.registerTempTable("test")

val doubleDF = sqlContext.sql("select cast(trim(regexp_replace( a , ',' , '.')) as double) as a from test ")

doubleDF.show
+----+
|   a|
+----+
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
+----+

doubleDF.printSchema
root
 |-- a: double (nullable = true)

is it possible to pass the column name as a parameter to the col() function in your sample code? 是否可以将列名作为参数传递给示例代码中的col()函数? Something like this: 像这样:

# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())

col_name1 = "Atr1"
col_name2 = "Atr2"

out = df
   .withColumn(col_name1, udf(col(col_name1)).cast(DoubleType()))
   .withColumn(col_name2, udf(col(col_name2)).cast(DoubleType()))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pyspark 用另一个值替换数据框中的所有值 - pyspark replace all values in dataframe with another values 用pyspark替换数据框中列的所有值 - Replace all values of a column in a dataframe with pyspark 在 pyspark 数据框中替换空值的问题 - issue to replace null values in pyspark dataframe PySpark:根据条件替换数据框的值 - PySpark: Replace values of dataframe based on criteria 如果在 Pandas 数据框中包含子字符串,则替换整个字符串,但包含值列表 - Replace Whole String if it contains substring in pandas dataframe, but with a list of values Pyspark通过使用另一列中的值替换Spark dataframe列中的字符串 - Pyspark replace strings in Spark dataframe column by using values in another column 随机将select x(x固定)列中的值替换为0 in pyspark dataframe - Randomly select x(x is fixed) values in a column and replace it with 0 in pyspark dataframe Pyspark 基于另一个 dataframe 替换数组列上的值 - Pyspark replace values on array column based on another dataframe 将 pyspark dataframe 数组列中每个数组的值替换为其对应的 id - Replace values of each array in pyspark dataframe array column by their corresponding ids 当列表值与Pyspark数据框中的列值的子字符串匹配时,填充新列 - Populate new columns when list values match substring of column values in Pyspark dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM