简体   繁体   English

如何在 Scala 中处理文本限定符分隔的文件

[英]How to process Text Qualifier delimited file in scala

I have a lot of delimited files with Text Qualifier (every column start and end has double quote).我有很多带有文本限定符的分隔文件(每列开始和结束都有双引号)。 Delimited is not consistent ie there can be any delimited like comma(,), Pipe (|), ~, tab (\\t).分隔符不一致,即可以有任何分隔符,如逗号 (,)、管道 (|)、~、制表符 (\\t)。

I need to read this file with text (single column) and then check no of delimiters by considering Text Qualifier.我需要用文本(单列)读取这个文件,然后通过考虑文本限定符来检查没有分隔符。 If any record has less or more columns than defined that record should be rejected and loaded to different path.如果任何记录的列比定义的少或多,则应拒绝该记录并将其加载到不同的路径。

Below is test data with 3 columns ID, Name and DESC.下面是包含 3 列 ID、名称和 DESC 的测试数据。 DESC column has extra delimiter. DESC 列有额外的分隔符。

"ID","Name","DESC" “ID”、“名称”、“DESC”
"1" , "ABC", "A,BC" "1" , "ABC", "A,BC"
"2" , "XYZ" , "ABC is bother" "2" , "XYZ" , "ABC 打扰了"
"3" , "YYZ" , "" "3"、"YYZ"、""
4 , "XAA" , "sf,sd 4 , "XAA" , "sf,sd
sdfsf" sdfsf"

Last record splitted into two records due new line char in desc field由于 desc 字段中的新行字符,最后一条记录拆分为两条记录

Below is the code I tried to handle but not able to handle correctly.下面是我尝试处理但无法正确处理的代码。

val SourceFileDF = spark.read.text(InputFilePath)
SourceFile = SourceFile.filter("value != ''") // Removing empty records while reading
val aCnt = coalesce(length(regexp_replace($"value","[^,]", "")), lit(0)) //to count no of delimiters
val Delimitercount = SourceFileDF.withColumn("a_cnt", aCnt)
var invalidrecords= Delimitercount
                    .filter(col("a_cnt")
                    .!==(NoOfDelimiters)).toDF()
val GoodRecordsDF = Delimitercount
                .filter(col("a_cnt")
                .equalTo(NoOfDelimiters)).drop("a_cnt")

With above code I am able to reject all the records which has less or more delimiters but not able to ignore if delimiter is with in text qualifier.使用上面的代码,我可以拒绝所有具有更少或更多分隔符但不能忽略分隔符是否在文本限定符中的记录。

Thanks in Advance.提前致谢。

You may use a closure with replaceAllIn to remove any chars you want inside a match:您可以使用带有replaceAllIn的闭包来删除匹配中所需的任何字符:

var y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
y = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", ""))
print(y) // => 4 , "XAA" , "sfsdnsdfsf"

See the Scala demo .请参阅Scala 演示

Details细节

  • " - matches a " " - 匹配一个"
  • [^"]* - any 0+ chars other than " [^"]* - 除"之外的任何 0+ 个字符
  • (?:""[^"]*)* - matches 0 or more sequences of "" and then 0+ chars other than " (?:""[^"]*)* - 匹配 0 个或多个""序列,然后匹配除" 0+ 个字符
  • " - a " . " -一个"

The code finds all non-overlapping matches of the above pattern in y and upon finding a match ( m ) the , and newlines (LF) are removed from the match value (with m.group(0).replaceAll("[,\\n]", "") , where m.group(0) is the match value and [,\\n] matches either , or a newline).代码查找所有非重叠上述图案的在比赛y和在发现一个匹配( m )的,和换行(LF)从匹配值去除(与m.group(0).replaceAll("[,\\n]", "") ,其中m.group(0)是匹配值, [,\\n]匹配,或换行符)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM