简体   繁体   中英

How to replace null values with a specific value in Dataframe using spark in Java?

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions ).

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

In your case it will be something like:

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))

You'll want to use the fill(String value, String[] columns) method of your dataframe, which automatically replaces Null values in a given list of columns with the value you specified.

So if you already know the value that you want to replace Null with...:

String[] colNames = {"Name"}
dataframe = dataframe.na.fill("a", colNames)

You can do the same for the rest of your columns.

You can use DataFrame.na.fill() to replace the null with some value To update at once you can do as

val map = Map("Name" -> "a", "Place" -> "a2")

df.na.fill(map).show()

But if you want to replace a bad record too then you need to validate the bad records first. You can do this by using regular expression with like function.

In order to replace the NULL values with a given string I've used fill function present in Spark for Java. It accepts the word to be replaced with and a sequence of column names. Here is how I have implemented that:-

List<String> colList = new ArrayList<String>();
colList.add(cols[i]);
Seq<String> colSeq = scala.collection.JavaConverters.asScalaIteratorConverter(colList.iterator()).asScala().toSeq();
data=data.na().fill(word, colSeq);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM