简体   繁体   English

如何使用grep或awk处理特定列(带有文本文件中的关键字)

[英]How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file. 我尝试了grep和awk命令的许多组合来处理文件中的文本。

This is a list of customers of this type: 这是此类型的客户的列表:

John,Mills,81,Crescent,New York,NY,john@mills.com,19/02/1954

I am trying to separate these records into two categories, MEN and FEMALES. 我正在尝试将这些记录分为两类,即男士和女士。

I have a list of some 5000 Female Names , all in plain text , all in one file. 我有一个列表,列出了约5000个女性名字,所有名字都是纯文本,都在一个文件中。

How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ? 如何“ grep”第一列(因为我只匹配名字),但仍打印整个客户记录?

I found it easy to "cut" the first column and grep --file=female.names.txt , but this way it's not going to print the entire record any longer. 我发现很容易“剪切”第一列和grep --file=female.names.txt ,但是这样就不会再打印整个记录了。

I am aware of the awk option but in that case I don't know how to read the female names from file. 我知道awk选项,但是在那种情况下,我不知道如何从文件中读取女性名字。

awk -F ',' ' { if($1==" ???Filename??? ") print $0} '

Many thanks ! 非常感谢 !

You can do this with Awk: 您可以使用Awk做到这一点:

awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv 

Would print the lines of your csv file that contain first names of any found in your file female.names.txt . 将打印csv文件的行,其中包含在female.names.txt文件中找到的任何名字。

awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv 

Would output lines not found in female.names.txt . 将不会在female.names.txt找到输出行。

This assumes the format of your female.names.txt file is something like: 假设您female.names.txt文件的格式类似于:

Heather
Irene
Jane

Another alternative is Perl, which can be useful if you're not super-familiar with awk. 另一个选择是Perl,如果您不太熟悉awk,则可以使用它。

#!/usr/bin/perl -anF,
use strict;
our %names;

BEGIN {
    while (<ARGV>) {
        chomp;
        $names{$_} = 1;
    }
}

print if $names{$F[0]};

To run (assume you named this file filter.pl ): 要运行(假设您将此文件命名为filter.pl ):

perl filter.pl female.names.txt < records.txt

Try this: 尝试这个:

grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv

This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. 这会将女性名称列表中的所有名称更改为正则表达式^name,因此它仅在行的开头匹配,后跟逗号。 Then it uses process substitution to use that as the file to match against the data file. 然后,它使用进程替换将其用作与数据文件匹配的文件。

So, I've come up with the following: 因此,我提出了以下建议:

Suppose, you have a file having the following lines in a file named test.txt : 假设您有一个文件,该文件在名为test.txt的文件中包含以下几行:

abe 123 bdb 532

xyz 593 iau 591

Now you want to find the lines which include the first field having the first and last letters as vowels. 现在,您要查找包含第一个字段的行,其中第一个和最后一个字母作为元音。 If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output: 如果您执行了简单的grep ,则将获得两行内容,但以下内容仅给出第一行,即所需的输出:

egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt

Then you want to the find the lines which include the third field having the first and last letters as vowels. 然后,您要查找包含第三个字段的行,第三个字段的第一个和最后一个字母为元音。 Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output: 相似地,如果您执行了简单的grep ,则将获得两行内容,但以下内容仅给出第二行,即所需的输出:

egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt

The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. 第一个花括号{1,}指定根据ASCII表从0到z的前一个字符可以出现任意次。 After that, we have the field separator space in this case . 之后, space in this case ,我们有字段分隔符space in this case Change the value within the second curly braces {0} or {2} to the desired field number-1 . 将第二个花括号{0} or {2}的值更改为desired field number-1 Then, use a regular expression to mention your criteria. 然后,使用正则表达式提及您的条件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM