简体   繁体   English

从 Spark 中的数据框列值中删除空格

[英]Remove blank space from data frame column values in Spark

I have a data frame ( business_df ) of schema:我有一个架构的数据框( business_df ):

|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)

I want to make a new data frame ( new_df ) so that the values in the 'name' column do not contain any blank spaces.我想创建一个新的数据框( new_df ),以便'name'列中的值不包含任何空格。

My code is:我的代码是:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType

udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()

I keep receiving this error:我不断收到此错误:

NameError: global name 'replace' is not defined

What's wrong with this code?这段代码有什么问题?

While the problem you've described is not reproducible with provided code, using Python UDFs to handle simple tasks like this, is rather inefficient.虽然您所描述的问题不能用提供的代码重现,但使用 Python UDFs来处理这样的简单任务效率很低。 If you want to simply remove spaces from the text use regexp_replace :如果您只想从文本中删除空格,请使用regexp_replace

from pyspark.sql.functions import regexp_replace, col

df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

df.select(regexp_replace(col("v"), " ", ""))

If you want to normalize empty lines use trim :如果要规范化空行,请使用trim

from pyspark.sql.functions import trim

df.select(trim(col("v")))

If you want to keep leading / trailing spaces you can adjust regexp_replace :如果要保留前导/尾随空格,可以调整regexp_replace

df.select(regexp_replace(col("v"), "^\s+$", ""))

As @zero323 said, it's probably that you overlapped the replace function somewhere.正如@zero323 所说,可能是您在某处重叠了replace功能。 I tested your code and it works perfectly.我测试了你的代码,它工作得很好。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()

#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+

Here's a function that removes all whitespace in a string:这是一个删除字符串中所有空格的函数:

import pyspark.sql.functions as F

def remove_all_whitespace(col):
    return F.regexp_replace(col, "\\s+", "")

You can use the function like this:您可以像这样使用该函数:

actual_df = source_df.withColumn(
    "words_without_whitespace",
    quinn.remove_all_whitespace(col("words"))
)

The remove_all_whitespace function is defined in the quinn library . remove_all_whitespace函数在quinn 库中定义。 quinn also defines single_space and anti_trim methods to manage whitespace. quinn 还定义了single_spaceanti_trim方法来管理空白。 PySpark defines ltrim , rtrim , and trim methods to manage whitespace. PySpark 定义了ltrimrtrimtrim方法来管理空白。

I think that solution using regexp_replace too slow even for few data!我认为即使对于很少的数据,使用 regexp_replace 的解决方案也太慢了! So I've tried to find another way and I think I found it!所以我试图找到另一种方法,我想我找到了!

Not beaultiful, little naive, but it's fast!不漂亮,有点天真,但速度很快! What do you think?你怎么认为?

def normalizeSpace(df,colName):

  # Left and right trim
  df = df.withColumn(colName,ltrim(df[colName]))
  df = df.withColumn(colName,rtrim(df[colName]))

  #This is faster than regexp_replace function!
  def normalize(row,colName):
      data = row.asDict()
      text = data[colName]
      spaceCount = 0;
      Words = []
      word = ''

      for char in text:
          if char != ' ':
              word += char
          elif word == '' and char == ' ':
              continue
          else:
              Words.append(word)
              word = ''

      if len(Words) > 0:
          data[colName] = ' '.join(Words)

      return Row(**data)

      df = df.rdd.map(lambda row:
                     normalize(row,colName)
                 ).toDF()
      return df
schema = StructType([StructField('name',StringType())])
rows = [Row(name='  dvd player samsung   hdmi hdmi 160W reais    de potencia 
bivolt   ')]
df = spark.createDataFrame(rows, schema)
df = normalizeSpace(df,'name')
df.show(df.count(),False)

That prints那个打印

+---------------------------------------------------+
|name                                               |
+---------------------------------------------------+
|dvd player samsung hdmi hdmi 160W reais de potencia|
+---------------------------------------------------+

As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinn Here are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html正如@Powers 所示,有一个非常好用且易于阅读的函数来删除一个名为 quinn 的包提供的空格。您可以在此处找到它: https : //github.com/MrPowers/quinn以下是有关如何操作的说明如果在 Data Bricks 工作区上工作,请安装它: https : //docs.databricks.com/libraries.html

Here again an illustration of how it works:这里再次说明它是如何工作的:

#import library 
import quinn

#create an example dataframe
df = sc.parallelize([
    (1, "foo bar"), (2, "foobar "), (3, "   ")
]).toDF(["k", "v"])

#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
    "v",
    quinn.remove_all_whitespace(col("v"))
)

The output:输出: 在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM