简体   繁体   中英

Replace new line (\n) character in csv file - spark scala

Just to illustrate the problem I have taken a testset csv file. But in real case scenario, the problem has to handle more than a TeraByte data.

I have a CSV file, where the columns are enclosed by quotes("col1"). But when the data import was done. One column contains new line character(\\n). This is leading me to lot of problems, when I want to save them as Hive tables.

My idea was to replace the \\n character with "|" pipe in spark.

I achieved so far :

1. val test = sqlContext.load(
        "com.databricks.spark.csv",
        Map("path" -> "test_set.csv", "header" -> "true", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))#read a csv file

 2.   val dataframe = test.toDF() #convert to dataframe

  3.    dataframe.foreach(println) #print

    4. dataframe.map(row => {
        val row4 = row.getAs[String](4)
        val make = row4.replaceAll("[\r\n]", "|") 
        (make)
      }).collect().foreach(println) #replace not working for me

Sample set :

(17 , D73 ,525, 1  ,testing\n    ,  90 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,526, 1  ,null         ,  89 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,529, 1  ,once \n again,  10 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,531, 1  ,test3\n      ,  10 ,20.07.2011 ,null ,F10 , R)

Expected result set :

(17 , D73 ,525, 1  ,testing|    ,  90 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,526, 1  ,null         ,  89 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,529, 1  ,once | again,  10 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,531, 1  ,test3|      ,  10 ,20.07.2011 ,null ,F10 , R)

what worked for me:

val rep = "\n123\n Main Street\n".replaceAll("[\\r\\n]", "|") rep: String = |123| Main Street|

but why I am not able to do on Tuple basis?

 val dataRDD = lines_wo_header.map(line => line.split(";")).map(row => (row(0).toLong, row(1).toString, 
                                               row(2).toLong, row(3).toLong, 
                                               row(4).toString, row(5).toLong,
                                               row(6).toString, row(7).toString, row(8).toString,row(9).toString)) 

dataRDD.map(row => {
                val wert = row._5.replaceAll("[\\r\\n]", "|") 
                (row._1,row._2,row._3,row._4,wert,row._6, row._7,row._8,row._9,row._10)
                }).collect().foreach(println)

Spark --version 1.3.1

If you can use Spark SQL 1.5 or higher, you may consider using the functions available for columns. Assuming you don't know (or don't have) the names for the columns, you can do as in the following snippet:

val df = test.toDF()

import org.apache.spark.sql.functions._
val newDF = df.withColumn(df.columns(4), regexp_replace(col(df.columns(4)), "[\\r\\n]", "|"))

If you know the name of the column, you can replace df.columns(4) by its name in both occurences.

I hope that helps. Cheers.

My idea was to replace the \\n character with "|" pipe in spark.

I tried replaceAll method but it is not working. Here is an alternative to achieve the same:

val test = sq.load(
        "com.databricks.spark.csv",
        Map("path" -> "file:///home/veda/sample.csv", "header" -> "false", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))

val dataframe = test.toDF()

val mapped = dataframe.map({
    row => {
    val str = row.get(0).toString()
    var fnal=new StringBuilder(str)
    //replace newLine 
    var newLineIndex=fnal.indexOf("\\n")
    while(newLineIndex != -1){
        fnal.replace(newLineIndex,newLineIndex+2,"|")
        newLineIndex = fnal.indexOf("\\n")                  
    }

    //replace carriage returns
    var cgIndex=fnal.indexOf("\\r")
    while(cgIndex != -1){
        fnal.replace(cgIndex,cgIndex+2,"|")
        cgIndex = fnal.indexOf("\\r")                   
    }

    (fnal.toString()) //tuple modified

    }
})

mapped.collect().foreach(println)

Note: You might want to move the duplicate code to separate function.

The multi line support for CSV is added in spark version 2.2 JIRA and spark 2.2 is not yet released.

I had faced same issue and resolved it with the help us hadoop input format and reader.

Copy InputFormat and reader classes from git and implement like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

//implementation

 JavaPairRDD<LongWritable, Text> rdd =
                context.
                        newAPIHadoopFile(path, FileCleaningInputFormat.class, null, null, new Configuration());
JavaRDD<String> inputWithMultiline= rdd.map(s -> s._2().toString())

Another solution - use CSVInputFormat from Apache crunch to read CSV file then parse each CSV line using opencsv:

sparkContext.newAPIHadoopFile(path, CSVInputFormat.class, null, null, new Configuration()).map(s -> s._2().toString());

Apache crunch maven dependency:

 <dependency>
      <groupId>org.apache.crunch</groupId>
      <artifactId>crunch-core</artifactId>
      <version>0.15.0</version>
  </dependency>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM