從 Spark 中的數據框列值中刪除空格

Question

我有一個架構的數據框（ business_df ）：

|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)

我想創建一個新的數據框（ new_df ），以便'name'列中的值不包含任何空格。

我的代碼是：

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()

我不斷收到此錯誤：

NameError: global name 'replace' is not defined

這段代碼有什么問題？

Answer 1

雖然您所描述的問題不能用提供的代碼重現，但使用 Python UDFs來處理這樣的簡單任務效率很低。 如果您只想從文本中刪除空格，請使用regexp_replace ：

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

如果要規范化空行，請使用trim ：

from pyspark.sql.functions import trim

df.select(trim(col("v")))

如果要保留前導/尾隨空格，可以調整regexp_replace ：

df.select(regexp_replace(col("v"), "^\s+$", ""))

Answer 2

正如@zero323 所說，可能是您在某處重疊了replace功能。 我測試了你的代碼，它工作得很好。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()

#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+

Answer 3

這是一個刪除字符串中所有空格的函數：

import pyspark.sql.functions as F

def remove_all_whitespace(col):
    return F.regexp_replace(col, "\\s+", "")

您可以像這樣使用該函數：

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

remove_all_whitespace函數在quinn 庫中定義。 quinn 還定義了single_space和anti_trim方法來管理空白。 PySpark 定義了ltrim 、 rtrim和trim方法來管理空白。

Answer 4

我認為即使對於很少的數據，使用 regexp_replace 的解決方案也太慢了！ 所以我試圖找到另一種方法，我想我找到了！

不漂亮，有點天真，但速度很快！ 你怎么認為？

def normalizeSpace(df,colName):

  # Left and right trim
  df = df.withColumn(colName,ltrim(df[colName]))
  df = df.withColumn(colName,rtrim(df[colName]))

  #This is faster than regexp_replace function!
  def normalize(row,colName):
      data = row.asDict()
      text = data[colName]
      spaceCount = 0;
      Words = []
      word = ''

      for char in text:
          if char != ' ':
              word += char
          elif word == '' and char == ' ':
              continue
          else:
              Words.append(word)
              word = ''

      if len(Words) > 0:
          data[colName] = ' '.join(Words)

      return Row(**data)

      df = df.rdd.map(lambda row:
                     normalize(row,colName)
                 ).toDF()
      return df
schema = StructType([StructField('name',StringType())])
rows = [Row(name='  dvd player samsung   hdmi hdmi 160W reais    de potencia 
bivolt   ')]
df = spark.createDataFrame(rows, schema)
df = normalizeSpace(df,'name')
df.show(df.count(),False)

那個打印

+---------------------------------------------------+
|name                                               |
+---------------------------------------------------+
|dvd player samsung hdmi hdmi 160W reais de potencia|
+---------------------------------------------------+

Answer 5

正如@Powers 所示，有一個非常好用且易於閱讀的函數來刪除一個名為 quinn 的包提供的空格。您可以在此處找到它： https : //github.com/MrPowers/quinn以下是有關如何操作的說明如果在 Data Bricks 工作區上工作，請安裝它： https : //docs.databricks.com/libraries.html

這里再次說明它是如何工作的：

#import library 
import quinn

#create an example dataframe
df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
    "v",
    quinn.remove_all_whitespace(col("v"))
)

輸出：

從 Spark 中的數據框列值中刪除空格

問題描述

5 個解決方案

解決方案1
25 已采納 2016-02-22 00:53:55

解決方案2
4 2016-02-21 21:21:24

解決方案3
3 2017-11-24 02:53:42

解決方案4
1 2018-10-05 16:26:25

解決方案5
1 2020-01-24 19:37:08

從 Spark 中的數據框列值中刪除空格

問題描述

5 個解決方案

解決方案1 25 已采納 2016-02-22 00:53:55

解決方案2 4 2016-02-21 21:21:24

解決方案3 3 2017-11-24 02:53:42

解決方案4 1 2018-10-05 16:26:25

解決方案5 1 2020-01-24 19:37:08

解決方案1
25 已采納 2016-02-22 00:53:55

解決方案2
4 2016-02-21 21:21:24

解決方案3
3 2017-11-24 02:53:42

解決方案4
1 2018-10-05 16:26:25

解決方案5
1 2020-01-24 19:37:08