简体   繁体   English

解析CSV大文件,bash中有异常字符、空格、括号、不规则回车

[英]Parsing a large CSV file with unusual characters, spacing, brackets, and irregular returns in bash

I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.我有一个非常大的 (1.5 GB) 格式错误的 CSV 文件,我需要读入 R,虽然文件本身是 CSV,但由于行返回位置不当,分隔符在一定数量的行后中断。

I have a reduced example attached , but a truncated visual representation of that looks like this:附上了一个简化的示例,但它的截断视觉表示如下所示:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000
   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000
    0.00000000   0.00000000   0.00000000]
 [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000
   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000
   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111
    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222
  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]
 [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222
   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]
 [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222
  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]
 [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222
  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]
 [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222
  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]
 [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222
  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000
   -0.00000000   0.00000000   0.00000000]]"

New lines and all as /n's in the CSVs. CSV 中的新行和所有作为/n 的行。

To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.为了解决将其全部加载到 memory 并尝试在其他环境中将其解析为 dataframe 的问题,我一直在尝试将 CSV 中的相关片段打印到终端,删除字符回车,折叠空格,并在变量之间输入逗号.

Like the following:像下面这样:

000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"

My main attempt to pull all the information from everything from a line between parentheses and brackets with:我主要尝试从括号和方括号之间的一行中提取所有信息:

awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}'  Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','

outputting:输出:

000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"

Gets close, but:接近,但是:

  1. It only prints the first instance so I need a way to find the other instances.它只打印第一个实例,所以我需要一种方法来找到其他实例。
  2. It's inserting commas into places in blank spaces before the characters I'm searching for ( "[[]]" ), which I don't need it to do.它在我要搜索的字符 ( "[[]]" ) 之前的空白处插入逗号,我不需要这样做。
  3. It leaves some extra commas by the brackets that I haven't quite found the right call to tr for to remove due to the necessary escape characters.它在括号中留下了一些额外的逗号,由于必要的转义字符,我还没有完全找到对tr正确调用以删除它们。

I didn't understand your goal.我不明白你的目标。 The CSV file seems to me a correct CSV file. CSV 文件在我看来是正确的 CSV 文件。 If you just want to remove line breaks, you can use Miller and clean-whitespace verb :如果只想删除换行符,可以使用Millerclean-whitespace 动词

mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv

to get this https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv得到这个https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.881449867118 8个

在此处输入图像描述

Assumptions:假设:

  • the only field that contains double quotes is the last field ( broken_column_var )唯一包含双引号的字段是最后一个字段 ( broken_column_var )
  • within the last field we do not have to worry about embedded/escaped double quotes (ie, for each data line the last field has exactly two double quotes)在最后一个字段中,我们不必担心嵌入/转义双引号(即,对于每个数据行,最后一个字段恰好有两个双引号)
  • all broken_column_var values contain at least one embedded linefeed (ie, each broken_column_var value spans at least 2 physical lines);所有broken_column_var值包含至少一个嵌入式换行符(即,每个broken_column_var值跨越至少 2 个物理行); otherwise we need to add some code to address both double quotes residing on the same line... doable, but will skip for now so as to not (further) complicate the proposed code否则我们需要添加一些代码来解决位于同一行的两个双引号......可行,但现在将跳过以免(进一步)使建议的代码复杂化

One (verbose) awk approach to removing the embedded linefeeds from broken_column_var while also replacing spaces with commas:一种(详细的) awk方法,用于从broken_column_var中删除嵌入式换行符,同时还用逗号替换空格:

awk '
NR==1              { print; next }                      # print header
!in_merge && /["]/ { split($0,a,"\"")                   # 1st double quote found; split line on double quote
                     head     = a[1]                    # save 1st part of line
                     data     = "\"" a[2]               # save double quote and 2nd part of line
                     in_merge = 1                       # set flag
                     next
                   }
 in_merge          { data = data " " $0                 # append current line to "data"
                     if ( $0 ~ /["]/ ) {                # if 2nd double quote found => process "data"
                        gsub(/[ ]+/,",",data)           # replace consecutive spaces with single comma
                        gsub(/,[]]/,"]",data)           # replace ",]" with "]"
                        gsub(/[[],/,"[",data)           # replace "[," with "["
                        print head data                 # print new line
                        in_merge = 0                    # clear flag
                     }
                   }
' Malformed.csv

This generates:这会产生:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"

Use double quote as the field separator.使用双引号作为字段分隔符。 A complete record has 1 or 3 fields.一条完整的记录有 1 或 3 个字段。

awk '
  BEGIN {FS = OFS = "\""}
  {$0 = prev $0; $1=$1}
  NF % 2 == 1 {print; prev = ""; next}
  {prev = $0}
  END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000    0.00000000   0.00000000   0.00000000] [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222] [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222] [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222] [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ] [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ] [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000   -0.00000000   0.00000000   0.00000000]]"

For a language with a CSV library, I've found perl's Text::CSV useful for quoted newlines:对于具有 CSV 库的语言,我发现 perl 的Text::CSV对引用的换行符很有用:

perl -e '
  use Text::CSV;
  my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
  open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
  while (my $row = $csv->getline ($fh)) {
    $row->[-1] =~ s/\n//g;
    $csv->say(STDOUT, $row);
  }
'

This might work for you (GNU sed):这可能对你有用(GNU sed):

sed -E '1b
        :a;N;/"$/!ba
        s/"/\n&/
        h
        s/\n/ /2g
        s/.*\n//
        s/ +/,/g
        s/,\]/]/g
        s/\[,/[/g
        H
        g
        s/\n.*\n//' file

Forget the header line.忘记 header 行。

Gather up each record.收集每条记录。

Introduce a newline before the last field.在最后一个字段之前引入换行符。

Make a copy of the ameliorated record.复制一份改进后的记录。

Replace all newlines from the second with spaces.用空格替换第二个的所有换行符。

Remove upto the first introduced newline.删除直到第一个引入的换行符。

Replace spaces by commas.用逗号替换空格。

Remove any introduced commas after or before square brackets.删除方括号前后引入的任何逗号。

Append the last field to the copy. Append 复制最后一个字段。

Make the copy current.使副本成为当前副本。

Remove everything between (and including) the introduced newlines.删除引入的换行符之间(包括在内)的所有内容。

NB Expects only the last field of each record is double quoted. NB 期望只有每条记录的最后一个字段是双引号。


Alternative:选择:

sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,\[/[/g;s/\],/]/g' file

You can use GoCSV's replace command to easily strip out newlines:您可以使用GoCSV 的替换命令轻松删除换行符:

gocsv replace          \
  -c broken_column_var \
  -regex '\s+'         \
  -repl ' '            \
  input.csv

That normalizes all contiguous whitespace ( \s+ ) to a single space.这会将所有连续的空白 ( \s+ ) 规范化为单个空格。

A very small Python script can also handle this:一个非常小的 Python 脚本也可以处理这个:

import csv
import re

ws_re = re.compile(r"\s+")

f_in = open("input.csv", newline="")
reader = csv.reader(f_in)

f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)

writer.writerow(next(reader))  # transfer header

for row in reader:
    row[5] = ws_re.sub(" ", row[5])
    writer.writerow(row)

By using loadable csv module:通过使用可加载csv模块:

Pure bash, from How to parse a CSV file in Bash?纯 bash,来自How to parse a CSV file in Bash? , with a little modification checking for double-quotes parity. , 稍作修改检查双引号奇偶校验。

#!/bin/bash

enable -f /usr/lib/bash/csv csv

exec {FD}< "$1"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[@]}"

numcols=${#headline[@]}

while read -ru $FD line; do
    while chk=${line//[^\"]}
    csv -a row -- "$line"
    [[ -n ${chk//\"\"} ]] || (( ${#row[@]} < numcols )); do
        read -ru $FD sline || break 2
        line+=$'\n'"$sline"
    done
    printf "$fieldfmt\\n" "${row[@]}"
done
exec {FD}>&-

With your broken_input.csv , this show:使用您的broken_input.csv ,这表明:

SubID   : "000000000"
Date1   : "0000-00-00"
date2   : "0000-00-00"
var1    : "0"
var2    : "FIRST\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000\n-0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000\n0.00000000   0.00000000   0.00000000]\n[ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000\n-0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000\n-0.00000000   0.00000000   0.0000000 ]]'"

SubID   : "000000000"
Date1   : "1111-11-11"
date2   : "1111-11-11"
var1    : "1"
var2    : "SECOND\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111\n1.11111111   1.11111111   1.11111111]]'"

SubID   : "000000000"
Date1   : "2222-22-22"
date2   : "2222-22-22"
var1    : "2"
var2    : "THIRD\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222\n-2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]\n[-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222\n2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]\n[-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222\n-2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]\n[-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222\n-2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]\n[-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222\n-2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]\n[ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222\n-2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]'"

SubID   : "111111111"
Date1   : "0000-00-00"
date2   : "0000-00-00"
var1    : "00"
var2    : "FIRST\ TEXT\ FOR\ ONE"
broken_column_var: "$'[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000\n-0.00000000   0.00000000   0.00000000]]'"

Your CSV seem not as so broken!!你的 CSV 好像没那么坏!!

Note:笔记:

Using for processing huge file (1,5Gb) is not recommended!!不推荐使用处理大文件 (1,5Gb)!! You may obtain better result using , or with appropriates libraries!您可以使用和适当的库获得更好的结果!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM