簡體   English   中英

替換Pyspark中數據框中的值的子字符串

[英]Replace SubString of values in a dataframe in Pyspark

我有一個帶有某些屬性的數據框,它具有下一個外觀:

+-------+-------+
| Atr1  | Atr2  |
+-------+-------+
|  3,06 |  4,08 |
|  3,03 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  ...  |  ...  |
+-------+-------+

如您所見,數據幀的Atr1和Atr2的值是帶有','字符的數字。 這是因為我已經從CSV加載了這些數據,其中DoubleType數字的小數表示為“,”。

當我將數據加載到數據幀中時,值將強制轉換為String,因此我對此類屬性從String強制轉換為DoubleType:

df = df.withColumn("Atr1", df["Atr1"].cast(DoubleType()))
df = df.withColumn("Atr2", df["Atr2"].cast(DoubleType()))

但是當我這樣做時,值將轉換為null

+-------+-------+
| Atr1  | Atr2  |
+-------+-------+
|  null |  null |
|  null |  null |
|  null |  null |
|  null |  null |
|  null |  null |
|  ...  |  ...  |
+-------+-------+

我猜想原因是DoubleType小數必須用'。'分隔。 而不是','。 但是我沒有機會編輯CSV文件,因此我想用“。”替換數據框中的“,”符號。 然后將轉換應用於DoubleType。

我該怎么辦?

您可以使用用戶定義的函數簡單地解決此問題。

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *

data = [Row(Atr1="3,06", Atr2="4,08"),
        Row(Atr1="3,06", Atr2="4,08"),
        Row(Atr1="3,06", Atr2="4,08")]

df = sqlContext.createDataFrame(data)

# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())

out = df
   .withColumn("Atr1", udf(col("Atr1")).cast(DoubleType()))
   .withColumn("Atr2", udf(col("Atr2")).cast(DoubleType()))

##############################################################
out.show()

+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+

##############################################################

out.printSchema()

root
 |-- Atr1: double (nullable = true)
 |-- Atr2: double (nullable = true)

編輯:根據評論的建議提供更緊湊的解決方案。

from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.functions import *

udf = UserDefinedFunction(lambda x: float(x.replace(",",".")), DoubleType())

out = df
    .withColumn("Atr1", udf(col("Atr1")))
    .withColumn("Atr2", udf(col("Atr2")))

假設您有:

sdf.show()
+-------+-------+
|   Atr1|   Atr2|
+-------+-------+
|  3,06 |  4,08 |
|  3,03 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
|  3,06 |  4,08 |
+-------+-------+

然后,以下代碼將產生所需的結果:

strToDouble = udf(lambda x: float(x.replace(",",".")), DoubleType())

sdf = sdf.withColumn("Atr1", strToDouble(sdf['Atr1']))
sdf = sdf.withColumn("Atr2", strToDouble(sdf['Atr2']))

sdf.show()
+----+----+
|Atr1|Atr2|
+----+----+
|3.06|4.08|
|3.03|4.08|
|3.06|4.08|
|3.06|4.08|
|3.06|4.08|
+----+----+

您也可以只使用SQL來完成。

val df = sc.parallelize(Array(
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08"),
      ("3,06", "4,08")
      )).toDF("a", "b")

df.registerTempTable("test")

val doubleDF = sqlContext.sql("select cast(trim(regexp_replace( a , ',' , '.')) as double) as a from test ")

doubleDF.show
+----+
|   a|
+----+
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
|3.06|
+----+

doubleDF.printSchema
root
 |-- a: double (nullable = true)

是否可以將列名作為參數傳遞給示例代碼中的col()函數? 像這樣:

# Create an user defined function to replace ',' for '.'
udf = UserDefinedFunction(lambda x: x.replace(",","."), StringType())

col_name1 = "Atr1"
col_name2 = "Atr2"

out = df
   .withColumn(col_name1, udf(col(col_name1)).cast(DoubleType()))
   .withColumn(col_name2, udf(col(col_name2)).cast(DoubleType()))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM