简体   繁体   中英

Parsing a large CSV file with unusual characters, spacing, brackets, and irregular returns in bash

I have a very large (1.5 GB) malformed CSV file I need to read into R, and while the file itself is a CSV, the delimiters break after a certain number of lines due to poorly-placed line returns.

I have a reduced example attached , but a truncated visual representation of that looks like this:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000
   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000
    0.00000000   0.00000000   0.00000000]
 [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000
   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000
   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111
    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222
  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]
 [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222
   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]
 [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222
  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]
 [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222
  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]
 [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222
  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]
 [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222
  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000
   -0.00000000   0.00000000   0.00000000]]"

New lines and all as /n's in the CSVs.

To get around loading it all into memory and attempting to parse it as a dataframe in other environments, I have been trying to print relevant snippets from the CSV to the terminal with character returns removed, empty spaces collapsed, and commas entered in-between variables.

Like the following:

000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"

My main attempt to pull all the information from everything from a line between parentheses and brackets with:

awk '/\"\[\[/{found=1} found{print; if (/]]"/) exit}'  Malformed_csv_Abridged.csv | tr -d '\n\r' | tr -s ' ' | tr ' ' ','

outputting:

000000000,0000-00-00,0000-00-00,0,FIRST,TEXT,FOR,ZERO,"[[,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[,-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000,]]"

Gets close, but:

  1. It only prints the first instance so I need a way to find the other instances.
  2. It's inserting commas into places in blank spaces before the characters I'm searching for ( "[[]]" ), which I don't need it to do.
  3. It leaves some extra commas by the brackets that I haven't quite found the right call to tr for to remove due to the necessary escape characters.

I didn't understand your goal. The CSV file seems to me a correct CSV file. If you just want to remove line breaks, you can use Miller and clean-whitespace verb :

mlr --csv clean-whitespace Malformed.csv >Malformed_c.csv

to get this https://gist.githubusercontent.com/aborruso/538e964c0c84a8b27d4c3d3b61d23bb4/raw/1fa83f43238be4a6aeb9c743aaf2e4da36f6cc74/Malformed_c.csv

在此处输入图像描述

Assumptions:

  • the only field that contains double quotes is the last field ( broken_column_var )
  • within the last field we do not have to worry about embedded/escaped double quotes (ie, for each data line the last field has exactly two double quotes)
  • all broken_column_var values contain at least one embedded linefeed (ie, each broken_column_var value spans at least 2 physical lines); otherwise we need to add some code to address both double quotes residing on the same line... doable, but will skip for now so as to not (further) complicate the proposed code

One (verbose) awk approach to removing the embedded linefeeds from broken_column_var while also replacing spaces with commas:

awk '
NR==1              { print; next }                      # print header
!in_merge && /["]/ { split($0,a,"\"")                   # 1st double quote found; split line on double quote
                     head     = a[1]                    # save 1st part of line
                     data     = "\"" a[2]               # save double quote and 2nd part of line
                     in_merge = 1                       # set flag
                     next
                   }
 in_merge          { data = data " " $0                 # append current line to "data"
                     if ( $0 ~ /["]/ ) {                # if 2nd double quote found => process "data"
                        gsub(/[ ]+/,",",data)           # replace consecutive spaces with single comma
                        gsub(/,[]]/,"]",data)           # replace ",]" with "]"
                        gsub(/[[],/,"[",data)           # replace "[," with "["
                        print head data                 # print new line
                        in_merge = 0                    # clear flag
                     }
                   }
' Malformed.csv

This generates:

SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000,0.00000000],[-0.00000000,-0.0000000,-0.00000000,-0.00000000,-0.0000000,-0.0000000,-0.0000000,0.00000000,0.00000000,-0.00000000,-0.00000000,0.00000000,0.0000000]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[1.11111111,-1.11111111,-1.1111111,-1.1111111,1.1111111,1.11111111,1.11111111,1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222,2.22222222,-2.22222222,-2.22222222,2.2222222,-2.22222222,-2.22222222,-2.22222222,-2.22222222,2.22222222,2.22222222,2.22222222],[-2.22222222,-2.22222222,2.22222222,2.2222222,2.22222222,-2.22222222,2.2222222,-2.2222222,2.22222222,2.2222222,2.222222,-2.22222222],[-2.22222222,-2.2222222,2.22222222,2.2222222,2.22222222,-2.22222222,-2.22222222,-2.2222222,-2.22222222,2.22222222,2.2222222,2.22222222],[-2.22222222,-2.22222222,2.2222222,2.2222222,2.2222222,-2.22222222,-2.222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,2.2222222],[-2.22222222,-2.222222,2.22222222,2.22222222,2.22222222,-2.2222222,-2.2222222,-2.2222222,-2.2222222,-2.22222222,2.22222222,-2.222222],[2.22222222,-2.22222222,-2.222222,-2.222222,-2.2222222,-2.22222222,-2.222222,-2.22222222,2.2222222,-2.2222222,2.2222222,2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[-0.00000000,0.00000000,-0.00000000,0.000000,-0.00000000,-0.00000000,0.00000000,0.00000000]]"

Use double quote as the field separator. A complete record has 1 or 3 fields.

awk '
  BEGIN {FS = OFS = "\""}
  {$0 = prev $0; $1=$1}
  NF % 2 == 1 {print; prev = ""; next}
  {prev = $0}
  END {if (prev) print prev}
' file.csv
SubID,Date1,date2,var1,var2,broken_column_var
000000000,0000-00-00,0000-00-00,0,FIRST TEXT FOR ZERO,"[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000   -0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000    0.00000000   0.00000000   0.00000000] [ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000   -0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000   -0.00000000   0.00000000   0.0000000 ]]"
000000000,1111-11-11,1111-11-11,1,SECOND TEXT FOR ZERO,"[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111    1.11111111   1.11111111   1.11111111]]"
000000000,2222-22-22,2222-22-22,2,THIRD TEXT FOR ZERO,"[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222  -2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222] [-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222   2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222] [-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222  -2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222] [-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222  -2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ] [-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222  -2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ] [ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222  -2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]"
111111111,0000-00-00,0000-00-00,00,FIRST TEXT FOR ONE,"[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000   -0.00000000   0.00000000   0.00000000]]"

For a language with a CSV library, I've found perl's Text::CSV useful for quoted newlines:

perl -e '
  use Text::CSV;
  my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1 });
  open my $fh, "<:encoding(utf8)", "file.csv" or die "test.csv: $!";
  while (my $row = $csv->getline ($fh)) {
    $row->[-1] =~ s/\n//g;
    $csv->say(STDOUT, $row);
  }
'

This might work for you (GNU sed):

sed -E '1b
        :a;N;/"$/!ba
        s/"/\n&/
        h
        s/\n/ /2g
        s/.*\n//
        s/ +/,/g
        s/,\]/]/g
        s/\[,/[/g
        H
        g
        s/\n.*\n//' file

Forget the header line.

Gather up each record.

Introduce a newline before the last field.

Make a copy of the ameliorated record.

Replace all newlines from the second with spaces.

Remove upto the first introduced newline.

Replace spaces by commas.

Remove any introduced commas after or before square brackets.

Append the last field to the copy.

Make the copy current.

Remove everything between (and including) the introduced newlines.

NB Expects only the last field of each record is double quoted.


Alternative:

sed -E '1b;:a;N;/"$/!ba;y/\n/ /;:b;s/("\S+) +/\1,/;tb;s/,\[/[/g;s/\],/]/g' file

You can use GoCSV's replace command to easily strip out newlines:

gocsv replace          \
  -c broken_column_var \
  -regex '\s+'         \
  -repl ' '            \
  input.csv

That normalizes all contiguous whitespace ( \s+ ) to a single space.

A very small Python script can also handle this:

import csv
import re

ws_re = re.compile(r"\s+")

f_in = open("input.csv", newline="")
reader = csv.reader(f_in)

f_out = open("output.csv", "w", newline="")
writer = csv.writer(f_out)

writer.writerow(next(reader))  # transfer header

for row in reader:
    row[5] = ws_re.sub(" ", row[5])
    writer.writerow(row)

By using loadable csv module:

Pure bash, from How to parse a CSV file in Bash? , with a little modification checking for double-quotes parity.

#!/bin/bash

enable -f /usr/lib/bash/csv csv

exec {FD}< "$1"
read -ru $FD line
csv -a headline "$line"
printf -v fieldfmt '%-8s: "%%q"\\n' "${headline[@]}"

numcols=${#headline[@]}

while read -ru $FD line; do
    while chk=${line//[^\"]}
    csv -a row -- "$line"
    [[ -n ${chk//\"\"} ]] || (( ${#row[@]} < numcols )); do
        read -ru $FD sline || break 2
        line+=$'\n'"$sline"
    done
    printf "$fieldfmt\\n" "${row[@]}"
done
exec {FD}>&-

With your broken_input.csv , this show:

SubID   : "000000000"
Date1   : "0000-00-00"
date2   : "0000-00-00"
var1    : "0"
var2    : "FIRST\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[ -0.00000000   0.00000000  -0.00000000  -0.00000000   0.00000000\n-0.00000000  -0.00000000   0.00000000   0.00000000   0.00000000\n0.00000000   0.00000000   0.00000000]\n[ -0.00000000  -0.0000000   -0.00000000  -0.00000000  -0.0000000\n-0.0000000   -0.0000000    0.00000000   0.00000000  -0.00000000\n-0.00000000   0.00000000   0.0000000 ]]'"

SubID   : "000000000"
Date1   : "1111-11-11"
date2   : "1111-11-11"
var1    : "1"
var2    : "SECOND\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[  1.11111111  -1.11111111  -1.1111111   -1.1111111    1.1111111\n1.11111111   1.11111111   1.11111111]]'"

SubID   : "000000000"
Date1   : "2222-22-22"
date2   : "2222-22-22"
var1    : "2"
var2    : "THIRD\ TEXT\ FOR\ ZERO"
broken_column_var: "$'[[-2.2222222   2.22222222 -2.22222222 -2.22222222  2.2222222  -2.22222222\n-2.22222222 -2.22222222 -2.22222222  2.22222222  2.22222222  2.22222222]\n[-2.22222222 -2.22222222  2.22222222  2.2222222   2.22222222 -2.22222222\n2.2222222  -2.2222222   2.22222222  2.2222222   2.222222   -2.22222222]\n[-2.22222222 -2.2222222   2.22222222  2.2222222   2.22222222 -2.22222222\n-2.22222222 -2.2222222  -2.22222222  2.22222222  2.2222222   2.22222222]\n[-2.22222222 -2.22222222  2.2222222   2.2222222   2.2222222  -2.22222222\n-2.222222   -2.2222222  -2.2222222  -2.22222222  2.22222222  2.2222222 ]\n[-2.22222222 -2.222222    2.22222222  2.22222222  2.22222222 -2.2222222\n-2.2222222  -2.2222222  -2.2222222  -2.22222222  2.22222222 -2.222222  ]\n[ 2.22222222 -2.22222222 -2.222222   -2.222222   -2.2222222  -2.22222222\n-2.222222   -2.22222222  2.2222222  -2.2222222   2.2222222   2.22222222]]'"

SubID   : "111111111"
Date1   : "0000-00-00"
date2   : "0000-00-00"
var1    : "00"
var2    : "FIRST\ TEXT\ FOR\ ONE"
broken_column_var: "$'[[ -0.00000000   0.00000000  -0.00000000   0.000000    -0.00000000\n-0.00000000   0.00000000   0.00000000]]'"

Your CSV seem not as so broken!!

Note:

Using for processing huge file (1,5Gb) is not recommended!! You may obtain better result using , or with appropriates libraries!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM