简体   繁体   English

替换CSV文件中的新行(\\ n)字符-Spark Scala

[英]Replace new line (\n) character in csv file - spark scala

Just to illustrate the problem I have taken a testset csv file. 只是为了说明问题,我已使用了testset csv文件。 But in real case scenario, the problem has to handle more than a TeraByte data. 但是在实际情况下,该问题必须处理的数据量不止一个TeraByte。

I have a CSV file, where the columns are enclosed by quotes("col1"). 我有一个CSV文件,其中各列用引号(“ col1”)括起来。 But when the data import was done. 但是,当数据导入完成时。 One column contains new line character(\\n). 一列包含换行符(\\ n)。 This is leading me to lot of problems, when I want to save them as Hive tables. 当我想将它们另存为Hive表时,这导致了很多问题。

My idea was to replace the \\n character with "|" 我的想法是将\\ n字符替换为“ |” pipe in spark. 管火花。

I achieved so far : 到目前为止,我实现了:

1. val test = sqlContext.load(
        "com.databricks.spark.csv",
        Map("path" -> "test_set.csv", "header" -> "true", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))#read a csv file

 2.   val dataframe = test.toDF() #convert to dataframe

  3.    dataframe.foreach(println) #print

    4. dataframe.map(row => {
        val row4 = row.getAs[String](4)
        val make = row4.replaceAll("[\r\n]", "|") 
        (make)
      }).collect().foreach(println) #replace not working for me

Sample set : 样本集:

(17 , D73 ,525, 1  ,testing\n    ,  90 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,526, 1  ,null         ,  89 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,529, 1  ,once \n again,  10 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,531, 1  ,test3\n      ,  10 ,20.07.2011 ,null ,F10 , R)

Expected result set : 预期结果集:

(17 , D73 ,525, 1  ,testing|    ,  90 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,526, 1  ,null         ,  89 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,529, 1  ,once | again,  10 ,20.07.2011 ,null ,F10 , R)
 (17 , D73 ,531, 1  ,test3|      ,  10 ,20.07.2011 ,null ,F10 , R)

what worked for me: 对我有用的是:

val rep = "\n123\n Main Street\n".replaceAll("[\\r\\n]", "|") rep: String = |123| Main Street|

but why I am not able to do on Tuple basis? 但是为什么我不能以Tuple为基础呢?

 val dataRDD = lines_wo_header.map(line => line.split(";")).map(row => (row(0).toLong, row(1).toString, 
                                               row(2).toLong, row(3).toLong, 
                                               row(4).toString, row(5).toLong,
                                               row(6).toString, row(7).toString, row(8).toString,row(9).toString)) 

dataRDD.map(row => {
                val wert = row._5.replaceAll("[\\r\\n]", "|") 
                (row._1,row._2,row._3,row._4,wert,row._6, row._7,row._8,row._9,row._10)
                }).collect().foreach(println)

Spark --version 1.3.1 Spark-版本1.3.1

If you can use Spark SQL 1.5 or higher, you may consider using the functions available for columns. 如果可以使用Spark SQL 1.5或更高版本,则可以考虑使用可用于列的函数 Assuming you don't know (or don't have) the names for the columns, you can do as in the following snippet: 假设您不知道(或没有)这些列的名称,则可以按照以下代码片段中的方法进行操作:

val df = test.toDF()

import org.apache.spark.sql.functions._
val newDF = df.withColumn(df.columns(4), regexp_replace(col(df.columns(4)), "[\\r\\n]", "|"))

If you know the name of the column, you can replace df.columns(4) by its name in both occurences. 如果知道该列的名称,则在两种情况下都可以用其名称替换df.columns(4)

I hope that helps. 希望对您有所帮助。 Cheers. 干杯。

My idea was to replace the \\n character with "|" 我的想法是将\\ n字符替换为“ |” pipe in spark. 管火花。

I tried replaceAll method but it is not working. 我尝试了replaceAll方法,但无法正常工作。 Here is an alternative to achieve the same: 这是达到相同目的的替代方法:

val test = sq.load(
        "com.databricks.spark.csv",
        Map("path" -> "file:///home/veda/sample.csv", "header" -> "false", "inferSchema" -> "true", "delimiter" -> "," , "quote" -> "\"", "escape" -> "\\" ,"parserLib" -> "univocity" ))

val dataframe = test.toDF()

val mapped = dataframe.map({
    row => {
    val str = row.get(0).toString()
    var fnal=new StringBuilder(str)
    //replace newLine 
    var newLineIndex=fnal.indexOf("\\n")
    while(newLineIndex != -1){
        fnal.replace(newLineIndex,newLineIndex+2,"|")
        newLineIndex = fnal.indexOf("\\n")                  
    }

    //replace carriage returns
    var cgIndex=fnal.indexOf("\\r")
    while(cgIndex != -1){
        fnal.replace(cgIndex,cgIndex+2,"|")
        cgIndex = fnal.indexOf("\\r")                   
    }

    (fnal.toString()) //tuple modified

    }
})

mapped.collect().foreach(println)

Note: You might want to move the duplicate code to separate function. 注意:您可能需要将重复的代码移至单独的功能。

The multi line support for CSV is added in spark version 2.2 JIRA and spark 2.2 is not yet released. spark 2.2 JIRA中添加了对CSV的多行支持,而spark 2.2尚未发布。

I had faced same issue and resolved it with the help us hadoop input format and reader. 我曾经遇到过同样的问题,并通过我们hadoop输入格式和阅读器的帮助解决了该问题。

Copy InputFormat and reader classes from git and implement like this: git复制InputFormat和reader类,并实现如下:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

//implementation

 JavaPairRDD<LongWritable, Text> rdd =
                context.
                        newAPIHadoopFile(path, FileCleaningInputFormat.class, null, null, new Configuration());
JavaRDD<String> inputWithMultiline= rdd.map(s -> s._2().toString())

Another solution - use CSVInputFormat from Apache crunch to read CSV file then parse each CSV line using opencsv: 另一个解决方案 -使用Apache Crunch中的CSVInputFormat读取CSV文件,然后使用opencsv解析每个CSV行:

sparkContext.newAPIHadoopFile(path, CSVInputFormat.class, null, null, new Configuration()).map(s -> s._2().toString());

Apache crunch maven dependency: Apache紧缩Maven依赖项:

 <dependency>
      <groupId>org.apache.crunch</groupId>
      <artifactId>crunch-core</artifactId>
      <version>0.15.0</version>
  </dependency>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM