查找文件中所有出现的字符串，并在Perl中打印其行号

Question

I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab. 我有一个包含400000行的大文件，每行包含许多由tab分隔的关键字。

And also I have a file that contains list of keywords to be matched. 我还有一个文件，其中包含要匹配的关键字列表。 Say this file act as a look up. 说这个文件充当查找。

So for each keyword in the look up table I need to search all its occurrence in the given file. 因此，对于查找表中的每个关键字，我需要在给定文件中搜索它的所有匹配项。 And should print the line number of the occurrence. 并且应该打印出现的行号。

I have tried this 我试过这个

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

open OUT, ">", "SampleLineNum.txt";

while( $line = <FILE1> )
{
    while( <FILE2> ) 
    {
        $linenum = $., last if(/$line/);
    }
    print OUT "$linenum ";
}

close FILE1;

This gives the first occurrence of the keyword. 这将首次出现关键字。 But I need all the occurrence and also the keyword should be exactly match. 但我需要所有的发生，并且关键字应该完全匹配。

The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world" 完全匹配面临的问题是，例如我有关键字“hello”和“hello world”

if I need to match "hello", it returns the line number which contains "hello world" also my script should match only "hello" and give its line number. 如果我需要匹配“hello”，它返回包含“hello world”的行号，我的脚本也只能匹配“hello”并给出它的行号。

Answer 1

Here is a solution that matches every occurrence of all keywords: 这是一个匹配所有关键字的每个匹配项的解决方案：

#!usr/bin/perl
use strict;
use warnings;

#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords,    '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt'   or die "Can't open search file: $!";

my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;

while (<$search_file>)
{
    while (/$regex/g)
    {
        print "$.: $1\n";
    }
}

keywords.txt: keywords.txt：

hello
foo
bar

search.txt: search.txt：

plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone

Output: 输出：

4: bar
4: bar
4: bar
5: hello
7: hello

Explanation: 说明：

This creates a single regex that matches all of the keywords in the keywords file. 这将创建一个匹配关键字文件中所有关键字的正则表达式。

<$keywords> - when this is used in list context, it returns a list of all lines of the file. <$keywords> - 当在列表上下文中使用它时，它返回文件所有行的列表。

map {chomp;qr/\\Q$_\\E/} - this removes the newline from each line and applies the \\Q...\\E quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter). map {chomp;qr/\\Q$_\\E/} - 这将从每一行中删除换行符并将\\Q...\\E quote-literal正则表达式运算符应用于每一行（这可确保如果您有一个关键字“foo.bar”它会将点视为文字字符，而不是正则表达式元字符。

join '|', - join the resulting list into a single string, separated by pipe characters. join '|', - 将结果列表连接成一个单独的字符串，用竖线字符分隔。

my $regex = qr|\\b($keyword_or)\\b|; - create a regex that looks like this: - 创建一个如下所示的正则表达式：

/\\b(\\Qhello\\E|\\Qfoo\\E|\\Qbar\\E)\\b/

This regex will match any of your keywords. 此正则表达式将匹配您的任何关键字。 \\b is the word boundary marker, ensuring that only whole words match: food no longer matches foo . \\b是单词边界标记，确保只有整个单词匹配： food不再匹配foo 。 The parentheses capture the specific keyword that matched in $1 . 括号捕获匹配$1的特定关键字。 This is how the output prints the keyword that matched. 这是输出打印匹配的关键字的方式。

I updated the solution to match each keyword on a given line and to only match complete words. 我更新了解决方案以匹配给定行上的每个关键字，并且只匹配完整的单词。

Answer 2

Is this part of something bigger? 这是更大的一部分吗？ Because this is a one liner with grep 因为这是一个带grep

grep -n hello filewithlotsalines.txt

grep -n "hello world" filewithlotsalines.txt

-n gets grep to show the line numbers first before the matching lines. -n获取grep以在匹配行之前首先显示行号。 You can do man grep for more options. 你可以做man grep以获得更多选择。

I am assuming here that you are on a linux or *nix system. 我假设你在linux或* nix系统上。

Answer 3

I have a different interpretation of your request. 我对您的请求有不同的解释。 It seems that you may want to maintain a list of line numbers where certain entries from a lookup table are found on lines of a 'keyword' file. 您似乎可能希望维护一个行号列表，其中查找表中的某些条目位于“关键字”文件的行上。 Here's a sample lookup table: 这是一个示例查找表：

hello world
hello
perl
hash
Test
script

And a tab-delimited 'keyword' file, where multiple keywords may be found on a single line: 还有一个制表符分隔的“关键字”文件，其中可以在一行中找到多个关键字：

programming tests
hello   everyone
hello   hello world perl
scripting   scalar
test    perl    script
hello world perl    script  hash

Given the above, consider the following solution: 鉴于上述情况，请考虑以下解决方案：

use strict;
use warnings;

my %lookupTable;

print "Enter the file path of lookup table: \n";
chomp( my $lookupTableFile = <> );

print "Enter the file path that contains keywords: \n";
chomp( my $keywordsFile = <> );

open my $ltFH, '<', $lookupTableFile or die $!;

while (<$ltFH>) {
    chomp;
    undef @{ $lookupTable{$_} };
}

close $ltFH;

open my $kfFH, '<', $keywordsFile or die $!;

while (<$kfFH>) {
    chomp;
    for my $keyword ( split /\t+/ ) {
        push @{ $lookupTable{$keyword} }, $. if defined $lookupTable{$keyword};
    }
}

close $kfFH;

open my $slFH, '>', 'SampleLineNum.txt' or die $!;

print $slFH "$_: @{ $lookupTable{$_} }\n"
  for sort { lc $a cmp lc $b } keys %lookupTable;

close $slFH;

print "Done!\n";

Output to SampleLineNum.txt : 输出到SampleLineNum.txt ：

hash: 6
hello: 2 3
hello world: 3 6
perl: 3 5 6
script: 5 6
Test:

The script uses a hash of arrays (HoA), where the key is an entry from the lookup table and the associated value is a reference to a list of line numbers where that entry was found on lines of a 'keyword' file. 该脚本使用数组散列（HoA），其中键是查找表中的条目，关联值是对行号列表的引用，其中该条目在“关键字”文件的行上找到。 The hash %lookupTable is initialized with a reference to an empty list. 哈希%lookupTable通过对空列表的引用进行初始化。

The each line of the 'keywords' file is split on the delimiting tab, and if a corresponding entry is defined in %lookupTable , the line number is push ed onto the corresponding list. “关键字”文件的每一行在分隔标签上split ，如果在%lookupTable定义了相应的条目，则将行号push送到相应的列表中。 When done, the %lookupTable keys are case-insensitively sorted and written out to SampleLineNum.txt , along with their corresponding list of line numbers where the entry was found, if any. 完成后，在%lookupTable键是不区分大小写的排序，并写入到SampleLineNum.txt ，与其中条目已找到，如果有行号及其对应的单。

There's no sanity checks on the file names entered, so consider adding those. 对输入的文件名没有健全性检查，因此请考虑添加这些文件。

Hope this helps! 希望这可以帮助！

Answer 4

To find all of the occurrences, you need to read in the keywords and then loop through the keywords to find matches for each line. 要查找所有事件，您需要读取关键字，然后循环关键字以查找每行的匹配项。 Here is what I modified to find keywords in the line using an array. 这是我修改后使用数组在行中查找关键字的内容。 In addition, I added a counter to count the line number and then if there is a match to print to print out the line number. 另外，我添加了一个计数器来计算行号，然后如果有匹配要打印以打印行号。 Your code will print out a item for each line even if there is not a match. 即使没有匹配项，您的代码也会打印出每行的项目。

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

# Read in all of the keywords
my @keywords = <FILE2>; 

# Close the file2
close(FILE2);

# Remove the line returns from the keywords
chomp @keywords;

# Sort and reverse the items to compare the maximum length items
# first (hello there before hello)
@keywords = reverse sort @keywords;

foreach my $k ( @keywords)
{
  print "$k\n";
}
open OUT, ">", "SampleLineNum.txt";
my $line;
# Counter for the lines in the file
my $count = 0;
while( $line = <FILE1> )
{
    # Increment the counter for the number of lines
    $count++;
    # loop through the keywords to find matches
    foreach my $k ( @keywords ) 
    {
        # If there is a match, print out the line number 
        # and use last to exit the loop and go to the 
        # next line
        if ( $line =~ m/$k/ ) 
        {
            print "$count\n";
            last;
        }
    }
}

close FILE1;

Answer 5

I think there are some questions similar to this one. 我认为有一些问题与此类似。 You can check out: 你可以看看：

The File::Grep module is interesting. File :: Grep模块很有趣。

Answer 6

as others had already given some perl solution,i will suggest you that may be you could use awk here. 因为其他人已经给出了一些perl解决方案，我建议你可以在这里使用awk。

> cat temp
abc
bac
xyz

> cat temp2
abc     jbfwerf kfnm
jfjkwebfkjwe    bac     xyz
ndwjkfn abc kenmfkwe    bac     xyz

> awk 'FNR==NR{a[$1];next}{for(i=1;i<=NF;i++)if($i in a)print $i,FNR}' temp temp2
abc 1
bac 2
xyz 2
abc 3
bac 3
xyz 3
>

查找文件中所有出现的字符串，并在Perl中打印其行号

问题描述

6 个解决方案

解决方案1
7 已采纳 2012-12-19 09:16:27

解决方案2
6 2012-12-19 05:54:18

解决方案3
1 2012-12-19 19:55:57

解决方案4
0 2012-12-19 06:43:28

解决方案5
0 2012-12-19 09:18:52

解决方案6
0 2012-12-19 09:19:41

查找文件中所有出现的字符串，并在Perl中打印其行号

问题描述

6 个解决方案

解决方案1 7 已采纳 2012-12-19 09:16:27

解决方案2 6 2012-12-19 05:54:18

解决方案3 1 2012-12-19 19:55:57

解决方案4 0 2012-12-19 06:43:28

解决方案5 0 2012-12-19 09:18:52

解决方案6 0 2012-12-19 09:19:41

解决方案1
7 已采纳 2012-12-19 09:16:27

解决方案2
6 2012-12-19 05:54:18

解决方案3
1 2012-12-19 19:55:57

解决方案4
0 2012-12-19 06:43:28

解决方案5
0 2012-12-19 09:18:52

解决方案6
0 2012-12-19 09:19:41