简体   繁体   English

如何使用Java中的spark在Dataframe中使用特定值替换空值?

[英]How to replace null values with a specific value in Dataframe using spark in Java?

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. 我正在尝试提高使用Java在Spark中实现的Logistic回归算法的准确性。 For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. 为此,我试图用该列的最频繁值替换列中存在的Null或无效值。 For Example:- 例如:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. 在这种情况下,我将“Name”列中的所有NULL值替换为“a”,将“Place”替换为“a2”列。 Till now I am able to extract only the most frequent columns in a particular column. 到目前为止,我只能提取特定列中最常见的列。 Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column. 关于如何用该列的最常用值替换空值或无效值,能否帮助我完成第二步。

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions ). 您可以使用.na.fill函数(它是org.apache.spark.sql.DataFrameNaFunctions中的函数)。

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame 基本上你需要的功能是: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN. 您可以选择列,然后选择要替换null或NaN的值。

In your case it will be something like: 在你的情况下,它将是这样的:

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))

You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified. 您将需要使用数据框的fill(String value,String [] columns)方法,该方法会自动使用您指定的值替换给定列列表中的Null值。

So if you already know the value that you want to replace Null with...: 因此,如果您已经知道要用N替换Null的值:

String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns. 您可以对其余列执行相同的操作。

You can use DataFrame.na.fill() to replace the null with some value To update at once you can do as 您可以使用DataFrame.na.fill()将null替换为某个值。要立即更新,您可以执行此操作

val map = Map("Name" -> "a", "Place" -> "a2")

df.na.fill(map).show()

But if you want to replace a bad record too then you need to validate the bad records first. 但是如果你想要替换坏记录,那么你需要先验证坏记录。 You can do this by using regular expression with like function. 您可以使用具有like功能的正则表达式来完成此操作。

In order to replace the NULL values with a given string I've used fill function present in Spark for Java. 为了用给定的字符串替换NULL值,我使用了Spark for Java中的fill函数。 It accepts the word to be replaced with and a sequence of column names. 它接受要替换的单词和一系列列名。 Here is how I have implemented that:- 以下是我实施的方法: -

List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM