[英]Remove blank space from data frame column values in Spark
I have a data frame ( business_df
) of schema:我有一个架构的数据框( business_df
):
|-- business_id: string (nullable = true)
|-- categories: array (nullable = true)
| |-- element: string (containsNull = true)
|-- city: string (nullable = true)
|-- full_address: string (nullable = true)
|-- hours: struct (nullable = true)
|-- name: string (nullable = true)
I want to make a new data frame ( new_df
) so that the values in the 'name'
column do not contain any blank spaces.我想创建一个新的数据框( new_df
),以便'name'
列中的值不包含任何空格。
My code is:我的代码是:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import StringType
udf = UserDefinedFunction(lambda x: x.replace(' ', ''), StringType())
new_df = business_df.select(*[udf(column).alias(name) if column == name else column for column in business_df.columns])
new_df.registerTempTable("vegas")
new_df.printSchema()
vegas_business = sqlContext.sql("SELECT stars, name from vegas limit 10").collect()
I keep receiving this error:我不断收到此错误:
NameError: global name 'replace' is not defined
What's wrong with this code?这段代码有什么问题?
While the problem you've described is not reproducible with provided code, using Python UDFs
to handle simple tasks like this, is rather inefficient.虽然您所描述的问题不能用提供的代码重现,但使用 Python UDFs
来处理这样的简单任务效率很低。 If you want to simply remove spaces from the text use regexp_replace
:如果您只想从文本中删除空格,请使用regexp_replace
:
from pyspark.sql.functions import regexp_replace, col
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("v"), " ", ""))
If you want to normalize empty lines use trim
:如果要规范化空行,请使用trim
:
from pyspark.sql.functions import trim
df.select(trim(col("v")))
If you want to keep leading / trailing spaces you can adjust regexp_replace
:如果要保留前导/尾随空格,可以调整regexp_replace
:
df.select(regexp_replace(col("v"), "^\s+$", ""))
As @zero323 said, it's probably that you overlapped the replace
function somewhere.正如@zero323 所说,可能是您在某处重叠了replace
功能。 I tested your code and it works perfectly.我测试了你的代码,它工作得很好。
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = sqlContext.createDataFrame([("aaa 111",), ("bbb 222",), ("ccc 333",)], ["names"])
spaceDeleteUDF = udf(lambda s: s.replace(" ", ""), StringType())
df.withColumn("names", spaceDeleteUDF("names")).show()
#+------+
#| names|
#+------+
#|aaa111|
#|bbb222|
#|ccc333|
#+------+
Here's a function that removes all whitespace in a string:这是一个删除字符串中所有空格的函数:
import pyspark.sql.functions as F
def remove_all_whitespace(col):
return F.regexp_replace(col, "\\s+", "")
You can use the function like this:您可以像这样使用该函数:
actual_df = source_df.withColumn(
"words_without_whitespace",
quinn.remove_all_whitespace(col("words"))
)
The remove_all_whitespace
function is defined in the quinn library . remove_all_whitespace
函数在quinn 库中定义。 quinn also defines single_space
and anti_trim
methods to manage whitespace. quinn 还定义了single_space
和anti_trim
方法来管理空白。 PySpark defines ltrim
, rtrim
, and trim
methods to manage whitespace. PySpark 定义了ltrim
、 rtrim
和trim
方法来管理空白。
I think that solution using regexp_replace too slow even for few data!我认为即使对于很少的数据,使用 regexp_replace 的解决方案也太慢了! So I've tried to find another way and I think I found it!所以我试图找到另一种方法,我想我找到了!
Not beaultiful, little naive, but it's fast!不漂亮,有点天真,但速度很快! What do you think?你怎么认为?
def normalizeSpace(df,colName):
# Left and right trim
df = df.withColumn(colName,ltrim(df[colName]))
df = df.withColumn(colName,rtrim(df[colName]))
#This is faster than regexp_replace function!
def normalize(row,colName):
data = row.asDict()
text = data[colName]
spaceCount = 0;
Words = []
word = ''
for char in text:
if char != ' ':
word += char
elif word == '' and char == ' ':
continue
else:
Words.append(word)
word = ''
if len(Words) > 0:
data[colName] = ' '.join(Words)
return Row(**data)
df = df.rdd.map(lambda row:
normalize(row,colName)
).toDF()
return df
schema = StructType([StructField('name',StringType())])
rows = [Row(name=' dvd player samsung hdmi hdmi 160W reais de potencia
bivolt ')]
df = spark.createDataFrame(rows, schema)
df = normalizeSpace(df,'name')
df.show(df.count(),False)
That prints那个打印
+---------------------------------------------------+
|name |
+---------------------------------------------------+
|dvd player samsung hdmi hdmi 160W reais de potencia|
+---------------------------------------------------+
As shown by @Powers there is a very nice and easy to read function to remove white spaces provided by a package called quinn.You can find it here: https://github.com/MrPowers/quinn Here are the instructions on how to install it if working on a Data Bricks workspace: https://docs.databricks.com/libraries.html正如@Powers 所示,有一个非常好用且易于阅读的函数来删除一个名为 quinn 的包提供的空格。您可以在此处找到它: https : //github.com/MrPowers/quinn以下是有关如何操作的说明如果在 Data Bricks 工作区上工作,请安装它: https : //docs.databricks.com/libraries.html
Here again an illustration of how it works:这里再次说明它是如何工作的:
#import library
import quinn
#create an example dataframe
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
#function call to remove whitespace. Note, withColumn will replace column v if it already exists
df = df.withColumn(
"v",
quinn.remove_all_whitespace(col("v"))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.