简体   繁体   中英

Bash - Removing empty columns from .csv file

I have a large .csv file in which I have to remove columns which are empty. By empty, I mean that they have a header, but the rest of the column contains no data.

I've written a Bash script to try and do this, but am running into a few issues. Here's the code:

#!/bin/bash

total="$(head -n 1 Reddit-cleaner.csv | grep -o ',' | wc -l)"
i=1
count=0
while [ $i -le $total ]; do
        cat Reddit-cleaner.csv | cut -d "," -f$i | while read CMD; do if [ -n CMD ]; then count=$count+1; fi; done
        if [ $count -eq 1 ]; then
                cut -d "," -f$i --complement <Reddit-cleaner.csv >Reddit-cleanerer.csv
        fi
        count=0
        i=$i+1
done

Firstly I find the number of columns, and store it in total. Then while the program has not reached the last column, I loop through the columns individually. The nested while loop checks if each row in the column is empty, and if there is more than one row that is not empty, it writes all other columns to another file.

I recognise that there are a few problems with this script. Firstly, the count modification occurs in a subshell, so count is never modified in the parent shell. Secondly, the file I am writing to will be overwritten every time the script finds an empty column.

So my question then is how can I fix this. I initially wanted to have it so that it wrote to a new file column by column, based on count, but couldn't figure out how to get that done either.

Edit: People have asked for a sample input and output.

Sample input:
User, Date, Email, Administrator, Posts, Comments
a, 20201719, a@a.com, Yes, , 3
b, 20182817, b@b.com, No, , 4
c, 20191618, , No, , 4
d, 20190126, , No, , 2

Sample output:
User, Data, Email, Administrator, Comments
a, 20201719, a@a.com, Yes, 3
b, 20182817, b@b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2

In the sample output, the column which has no data in it except for the header (Posts) has been removed, while the columns which are either entirely or partially filled remain.

I may be misinterpreting the question (due to its lack of example input and expected output), but this should be as simple as:

$ x="1,2,3,,4,field 5,,,six,7"
$ echo "${x//,+(,)/,}"
1,2,3,4,field 5,six,7

This requires bash with extglob enabled. Otherwise, you can use an external call to sed :

$ echo "1,2,3,,4,field 5,,,six,7" |sed 's/,,,*/,/g'
1,2,3,4,field 5,six,7

There's a lot of redundancy in your sample code. You should really consider awk since it already tracks the current field count (as NF ) and the number of lines (as NR ), so you could add that up with a simple total+=NF on each line. With the empty fields collapsed, awk can just run on the field number you want.

$ echo "1,2,3,,4,field 5,,,six,7" |awk -F ',+' '
  { printf "line %d has %d fields, the 6th of which is <%s>\n", NR, NF, $6 }'
line 1 has 7 fields, the 6th of which is <six>

This uses printf to denote the number of records ( NR , the current line number), the number of fields ( NF ) and the value of the sixth field ( $6 , can also be as a variable, eg $NF is the value of the final field since awk is one-indexed).

It is actually job of a CSV parser but you may use this awk script to get the job done:

cat removeEmptyCellsCsv.awk

BEGIN {
   FS = OFS = ", "
}
NR == 1 {
   for (i=1; i<=NF; i++)
      e[i] = 1  # initially all cols are marked empty
   next
}
FNR == NR {
   for (i=1; i<=NF; i++)
      e[i] = e[i] && ($i == "")
   next
}
{
   s = ""
   for (i=1; i<=NF; i++)
      s = s (i==1 || e[i-1] ? "" : OFS) (e[i] ? "" : $i)
   print s
}

Then run it as:

awk -f removeEmptyCellsCsv.awk file.csv{,}

Using sample data provided in question, it will produce following output:

1, User, Date, Email, Administrator, Comments
2, a, 20201719, a@a.com, Yes, 3
3, b, 20182817, b@b.com, No, 4
4, c, 20191618, , No, 4
5, d, 20190126, , No, 2

Note that Posts columns has been removed because it is empty in every record.

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    if ( NR > 1 ) {
        for (i=1; i<=NF; i++) {
            if ( $i ~ /[^[:space:]]/ ) {
                gotValues[i]
            }
        }
    }
    next
}
{
    c=0
    for (i=1; i<=NF; i++) {
        if (i in gotValues) {
            printf "%s%s", (c++ ? OFS : ""), $i
        }
    }
    print ""
}

$ awk -f tst.awk file file
User, Date, Email, Administrator, Comments
a, 20201719, a@a.com, Yes, 3
b, 20182817, b@b.com, No, 4
c, 20191618, , No, 4
d, 20190126, , No, 2

See also What's the most robust way to efficiently parse CSV using awk? if you need to work with any more complicated CSVs than the one in your question.

You can use Miller ( https://github.com/johnkerl/miller ) and its remove-empty-columns verb.

Starting from

+------+----------+---------+---------------+-------+----------+
| User | Date     | Email   | Administrator | Posts | Comments |
+------+----------+---------+---------------+-------+----------+
| a    | 20201719 | a@a.com | Yes           | -     | 3        |
| b    | 20182817 | b@b.com | No            | -     | 4        |
| c    | 20191618 | -       | No            | -     | 4        |
| d    | 20190126 | -       | No            | -     | 2        |
+------+----------+---------+---------------+-------+----------+

and running

mlr --csv remove-empty-columns input.csv >output.csv

you will have

+------+----------+---------+---------------+----------+
| User | Date     | Email   | Administrator | Comments |
+------+----------+---------+---------------+----------+
| a    | 20201719 | a@a.com | Yes           | 3        |
| b    | 20182817 | b@b.com | No            | 4        |
| c    | 20191618 | -       | No            | 4        |
| d    | 20190126 | -       | No            | 2        |
+------+----------+---------+---------------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM