简体   繁体   English

查找文件中所有出现的字符串,并在Perl中打印其行号

[英]Find all the occurrence of string in a file and print its line number in Perl

I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab. 我有一个包含400000行的大文件,每行包含许多由tab分隔的关键字。

And also I have a file that contains list of keywords to be matched. 我还有一个文件,其中包含要匹配的关键字列表。 Say this file act as a look up. 说这个文件充当查找。

So for each keyword in the look up table I need to search all its occurrence in the given file. 因此,对于查找表中的每个关键字,我需要在给定文件中搜索它的所有匹配项。 And should print the line number of the occurrence. 并且应该打印出现的行号。

I have tried this 我试过这个

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

open OUT, ">", "SampleLineNum.txt";

while( $line = <FILE1> )
{
    while( <FILE2> ) 
    {
        $linenum = $., last if(/$line/);
    }
    print OUT "$linenum ";
}

close FILE1;

This gives the first occurrence of the keyword. 这将首次出现关键字。 But I need all the occurrence and also the keyword should be exactly match. 但我需要所有的发生,并且关键字应该完全匹配。

The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world" 完全匹配面临的问题是,例如我有关键字“hello”和“hello world”

if I need to match "hello", it returns the line number which contains "hello world" also my script should match only "hello" and give its line number. 如果我需要匹配“hello”,它返回包含“hello world”的行号,我的脚本也只能匹配“hello”并给出它的行号。

Here is a solution that matches every occurrence of all keywords: 这是一个匹配所有关键字的每个匹配项的解决方案:

#!usr/bin/perl
use strict;
use warnings;

#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords,    '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt'   or die "Can't open search file: $!";

my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;

while (<$search_file>)
{
    while (/$regex/g)
    {
        print "$.: $1\n";
    }
}

keywords.txt: keywords.txt:

hello
foo
bar

search.txt: search.txt:

plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone

Output: 输出:

4: bar
4: bar
4: bar
5: hello
7: hello

Explanation: 说明:

This creates a single regex that matches all of the keywords in the keywords file. 这将创建一个匹配关键字文件中所有关键字的正则表达式。

<$keywords> - when this is used in list context, it returns a list of all lines of the file. <$keywords> - 当在列表上下文中使用它时,它返回文件所有行的列表。

map {chomp;qr/\\Q$_\\E/} - this removes the newline from each line and applies the \\Q...\\E quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter). map {chomp;qr/\\Q$_\\E/} - 这将从每一行中删除换行符并将\\Q...\\E quote-literal正则表达式运算符应用于每一行(这可确保如果您有一个关键字“foo.bar”它会将点视为文字字符,而不是正则表达式元字符。

join '|', - join the resulting list into a single string, separated by pipe characters. join '|', - 将结果列表连接成一个单独的字符串,用竖线字符分隔。

my $regex = qr|\\b($keyword_or)\\b|; - create a regex that looks like this: - 创建一个如下所示的正则表达式:

/\\b(\\Qhello\\E|\\Qfoo\\E|\\Qbar\\E)\\b/

This regex will match any of your keywords. 此正则表达式将匹配您的任何关键字。 \\b is the word boundary marker, ensuring that only whole words match: food no longer matches foo . \\b是单词边界标记,确保只有整个单词匹配: food不再匹配foo The parentheses capture the specific keyword that matched in $1 . 括号捕获匹配$1的特定关键字。 This is how the output prints the keyword that matched. 这是输出打印匹配的关键字的方式。

I updated the solution to match each keyword on a given line and to only match complete words. 我更新了解决方案以匹配给定行上的每个关键字,并且只匹配完整的单词。

Is this part of something bigger? 这是更大的一部分吗? Because this is a one liner with grep 因为这是一个带grep

grep -n hello filewithlotsalines.txt

grep -n "hello world" filewithlotsalines.txt

-n gets grep to show the line numbers first before the matching lines. -n获取grep以在匹配行之前首先显示行号。 You can do man grep for more options. 你可以做man grep以获得更多选择。

I am assuming here that you are on a linux or *nix system. 我假设你在linux或* nix系统上。

I have a different interpretation of your request. 我对您的请求有不同的解释。 It seems that you may want to maintain a list of line numbers where certain entries from a lookup table are found on lines of a 'keyword' file. 您似乎可能希望维护一个行号列表,其中查找表中的某些条目位于“关键字”文件的行上。 Here's a sample lookup table: 这是一个示例查找表:

hello world
hello
perl
hash
Test
script

And a tab-delimited 'keyword' file, where multiple keywords may be found on a single line: 还有一个制表符分隔的“关键字”文件,其中可以在一行中找到多个关键字:

programming tests
hello   everyone
hello   hello world perl
scripting   scalar
test    perl    script
hello world perl    script  hash

Given the above, consider the following solution: 鉴于上述情况,请考虑以下解决方案:

use strict;
use warnings;

my %lookupTable;

print "Enter the file path of lookup table: \n";
chomp( my $lookupTableFile = <> );

print "Enter the file path that contains keywords: \n";
chomp( my $keywordsFile = <> );

open my $ltFH, '<', $lookupTableFile or die $!;

while (<$ltFH>) {
    chomp;
    undef @{ $lookupTable{$_} };
}

close $ltFH;

open my $kfFH, '<', $keywordsFile or die $!;

while (<$kfFH>) {
    chomp;
    for my $keyword ( split /\t+/ ) {
        push @{ $lookupTable{$keyword} }, $. if defined $lookupTable{$keyword};
    }
}

close $kfFH;

open my $slFH, '>', 'SampleLineNum.txt' or die $!;

print $slFH "$_: @{ $lookupTable{$_} }\n"
  for sort { lc $a cmp lc $b } keys %lookupTable;

close $slFH;

print "Done!\n";

Output to SampleLineNum.txt : 输出到SampleLineNum.txt

hash: 6
hello: 2 3
hello world: 3 6
perl: 3 5 6
script: 5 6
Test: 

The script uses a hash of arrays (HoA), where the key is an entry from the lookup table and the associated value is a reference to a list of line numbers where that entry was found on lines of a 'keyword' file. 该脚本使用数组散列(HoA),其中键是查找表中的条目,关联值是对行号列表的引用,其中该条目在“关键字”文件的行上找到。 The hash %lookupTable is initialized with a reference to an empty list. 哈希%lookupTable通过对空列表的引用进行初始化。

The each line of the 'keywords' file is split on the delimiting tab, and if a corresponding entry is defined in %lookupTable , the line number is push ed onto the corresponding list. “关键字”文件的每一行在分隔标签上split ,如果在%lookupTable定义了相应的条目,则将行号push送到相应的列表中。 When done, the %lookupTable keys are case-insensitively sorted and written out to SampleLineNum.txt , along with their corresponding list of line numbers where the entry was found, if any. 完成后,在%lookupTable键是不区分大小写的排序,并写入到SampleLineNum.txt ,与其中条目已找到,如果有行号及其对应的单。

There's no sanity checks on the file names entered, so consider adding those. 对输入的文件名没有健全性检查,因此请考虑添加这些文件。

Hope this helps! 希望这可以帮助!

To find all of the occurrences, you need to read in the keywords and then loop through the keywords to find matches for each line. 要查找所有事件,您需要读取关键字,然后循环关键字以查找每行的匹配项。 Here is what I modified to find keywords in the line using an array. 这是我修改后使用数组在行中查找关键字的内容。 In addition, I added a counter to count the line number and then if there is a match to print to print out the line number. 另外,我添加了一个计数器来计算行号,然后如果有匹配要打印以打印行号。 Your code will print out a item for each line even if there is not a match. 即使没有匹配项,您的代码也会打印出每行的项目。

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

# Read in all of the keywords
my @keywords = <FILE2>; 

# Close the file2
close(FILE2);

# Remove the line returns from the keywords
chomp @keywords;

# Sort and reverse the items to compare the maximum length items
# first (hello there before hello)
@keywords = reverse sort @keywords;

foreach my $k ( @keywords)
{
  print "$k\n";
}
open OUT, ">", "SampleLineNum.txt";
my $line;
# Counter for the lines in the file
my $count = 0;
while( $line = <FILE1> )
{
    # Increment the counter for the number of lines
    $count++;
    # loop through the keywords to find matches
    foreach my $k ( @keywords ) 
    {
        # If there is a match, print out the line number 
        # and use last to exit the loop and go to the 
        # next line
        if ( $line =~ m/$k/ ) 
        {
            print "$count\n";
            last;
        }
    }
}

close FILE1;

I think there are some questions similar to this one. 我认为有一些问题与此类似。 You can check out: 你可以看看:

The File::Grep module is interesting. File :: Grep模块很有趣。

as others had already given some perl solution,i will suggest you that may be you could use awk here. 因为其他人已经给出了一些perl解决方案,我建议你可以在这里使用awk。

> cat temp
abc
bac
xyz

> cat temp2
abc     jbfwerf kfnm
jfjkwebfkjwe    bac     xyz
ndwjkfn abc kenmfkwe    bac     xyz

> awk 'FNR==NR{a[$1];next}{for(i=1;i<=NF;i++)if($i in a)print $i,FNR}' temp temp2
abc 1
bac 2
xyz 2
abc 3
bac 3
xyz 3
>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM