New to natural language processing. I have csv file containing about one million rows. I want to filter-out the third row that do not contain any data. For example
user1,user2, it really is
user3,user4, oh nothin
user5,user9,
user7,user8,
user9,user10,
user11,user12, i know im in
user13,user14,
user15,user16,
user17,user18, i think that might
user19,user20, what u
user21,user22, hmmm you never know
user23,user24, nicee
Expected output
user1,user2, it really is
user3,user4, oh nothin
user11,user12, i know im in
user17,user18, i think that might
user19,user20, what u
user21,user22, hmmm you never know
user23,user24, nicee
I have tried
awk -F',+' 'NF == 3' file > file
However, does not work
You can use this awk:
awk -F ',[[:blank:]]*' '$NF!=""' file
user1,user2, it really is
user3,user4, oh nothin
user11,user12, i know im in
user17,user18, i think that might
user19,user20, what u
user21,user22, hmmm you never know
user23,user24, nicee
'$NF'!=""
is actually a condition that checks whether 3rd field is populated.
PS: You cannot really do:
awk -F ',[[:blank:]]*' '$NF!=""' file > file
As input file and redirected file are same and you will end up with 0 byte file.
Better you do:
awk -F ',[[:blank:]]*' '$NF!=""' file > file.out && mv file.out file
In Perl, this prints a line unless it ends with a comma and whitespace.
perl -ne'/,\s*$/ or print' file
output
user1,user2, it really is
user3,user4, oh nothin
user11,user12, i know im in
user17,user18, i think that might
user19,user20, what u
user21,user22, hmmm you never know
user23,user24, nicee
You don't say if you're against using vim or not but, you can load your file in vim, and do:
:g/,\s\+$/d
:g is vim's global (operate on the entire file)
Syntax is :g/pattern/command
What is between the forward slashes is the regex pattern. Here we look for a comma, followed by as much whitespace as we can find (\\s+) until we hit the end of the line ($).
The command 'd' means "delete the line" when the regex matches.
Finally:
:wq
Writes the file (w) and quits (q).
Every line of your input has 3 fields (since there's always 2 commas) so NF is always 3. You want to test the contents of $NF being null, not the value of NF being 3. Also, NEVER do cmd file > file
for any command as the shell could do the > file
part before the cmd file
part and so zap your input file before it's been read by cmd
.
You need:
awk -F', *' '$NF!=""' file > tmp && mv tmp file
This problem/example has absolutely nothing to do with natural language processing, btw.
This is not so elegant, but perhaps more clear and easier to modify the field number:
#!/usr/bin/perl
open IN,$ARGV[0];
while(<IN>){
@line = split(",",$_);
if($line[2] =~ /\S/){
print;
}
}
$ARGV[0] is the name of the file whit your table; \\S means any character (not blank) at field #2 (fields are numbered from 0).
Here is a Perl
answer which I purposefully chose to demonstrate the usage of the -a
autosplit and the -F
field delimiter options:
perl -anF, -e 'print if $F[2] =~ /\S/' file > file.out
But I would probably prefer grep
in this particular case:
grep -E -v ',\s*$' file > file.out
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.