简体   繁体   中英

Sed, awk or perl to filter incomplete data column

New to natural language processing. I have csv file containing about one million rows. I want to filter-out the third row that do not contain any data. For example

user1,user2, it really is  
user3,user4, oh nothin  
user5,user9, 
user7,user8,  
user9,user10,  
user11,user12, i know im in 
user13,user14, 
user15,user16, 
user17,user18, i think that might     
user19,user20, what u 
user21,user22, hmmm you never know 
user23,user24, nicee

Expected output

user1,user2, it really is 
user3,user4, oh nothin   
user11,user12, i know im in  
user17,user18, i think that might     
user19,user20, what u  
user21,user22, hmmm you never know  
user23,user24, nicee

I have tried

awk -F',+' 'NF == 3' file > file    

However, does not work

You can use this awk:

awk -F ',[[:blank:]]*' '$NF!=""' file
user1,user2, it really is
user3,user4, oh nothin
user11,user12, i know im in
user17,user18, i think that might
user19,user20, what u
user21,user22, hmmm you never know
user23,user24, nicee

'$NF'!="" is actually a condition that checks whether 3rd field is populated.

PS: You cannot really do:

awk -F ',[[:blank:]]*' '$NF!=""' file > file

As input file and redirected file are same and you will end up with 0 byte file.

Better you do:

awk -F ',[[:blank:]]*' '$NF!=""' file > file.out && mv file.out file

In Perl, this prints a line unless it ends with a comma and whitespace.

perl -ne'/,\s*$/ or print' file

output

user1,user2, it really is
user3,user4, oh nothin
user11,user12, i know im in
user17,user18, i think that might
user19,user20, what u
user21,user22, hmmm you never know
user23,user24, nicee

You don't say if you're against using vim or not but, you can load your file in vim, and do:

:g/,\s\+$/d

:g is vim's global (operate on the entire file)

Syntax is :g/pattern/command

What is between the forward slashes is the regex pattern. Here we look for a comma, followed by as much whitespace as we can find (\\s+) until we hit the end of the line ($).

The command 'd' means "delete the line" when the regex matches.

Finally:

:wq

Writes the file (w) and quits (q).

Every line of your input has 3 fields (since there's always 2 commas) so NF is always 3. You want to test the contents of $NF being null, not the value of NF being 3. Also, NEVER do cmd file > file for any command as the shell could do the > file part before the cmd file part and so zap your input file before it's been read by cmd .

You need:

awk -F', *' '$NF!=""' file > tmp && mv tmp file

This problem/example has absolutely nothing to do with natural language processing, btw.

This is not so elegant, but perhaps more clear and easier to modify the field number:

#!/usr/bin/perl
open IN,$ARGV[0];
while(<IN>){
    @line = split(",",$_);
    if($line[2] =~ /\S/){
        print;
    }
}

$ARGV[0] is the name of the file whit your table; \\S means any character (not blank) at field #2 (fields are numbered from 0).

Here is a Perl answer which I purposefully chose to demonstrate the usage of the -a autosplit and the -F field delimiter options:

perl -anF, -e 'print if $F[2] =~ /\S/' file > file.out

But I would probably prefer grep in this particular case:

grep -E -v ',\s*$' file > file.out

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM