简体   繁体   English

将文本预处理函数应用于 scala spark 中的数据框列

[英]Apply a text-preprocessing function to a dataframe column in scala spark

I want to create a function to handle the text-prepocessing in a problem I am facing with text data.我想创建一个函数来处理我面临的文本数据问题中的文本预处理。 I am familiar with Python and pandas dataframe and my usual thought process of solving the problem is to use a function and then using pandas apply method to apply the function to all the elements in a column.我熟悉Python和pandas数据框,我解决问题的通常思路是使用一个函数,然后使用pandas apply方法将该函数应用于一列中的所有元素。 However I don't know where to begin to accomplish this.但是我不知道从哪里开始完成这个。

So, I created two functions to handle the replacements.所以,我创建了两个函数来处理替换。 The problem is that I don't know how to put more than one replace inside this method.问题是我不知道如何在此方法中放置多个替换。 I need to make about 20 replacements for three separate dataframes so to solve it with this method it would take me 60 lines of code.我需要对三个单独的数据帧进行大约 20 次替换,因此用这种方法解决它需要 60 行代码。 Is there a way to do all the replacements inside a single function and then apply it to all the elements in a dataframe column in scala?有没有办法在单个函数中进行所有替换,然后将其应用于 Scala 中数据帧列中的所有元素?

def removeSpecials: String => String = _.replaceAll("$", " ")
def removeSpecials2: String => String = _.replaceAll("?", " ")
val udf_removeSpecials = udf(removeSpecials)
val udf_removeSpecials2 = udf(removeSpecials2)
val consolidated2 = consolidated.withColumn("product_description", udf_removeSpecials($"product_description"))
val consolidated3 = consolidated2.withColumn("product_description", udf_removeSpecials2($"product_description"))
consolidated3.show()

Well you can simply add every replacement next to the previous one like this :好吧,您可以简单地将每个替换添加到前一个旁边,如下所示:

def removeSpecials: String => String = _.replaceAll("$", " ").replaceAll("?", " ")

But in this case where the replacement character is the same, it would be better to use regular expressions to avoid multiple replaceAll .但是在这种替换字符相同的情况下,最好使用正则表达式来避免多个replaceAll

def removeSpecials: String => String = _.replaceAll("\\$|\\?", " ")

Note that \\\\ is used as escape character.请注意, \\\\用作转义字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM