简体   繁体   English

查找文件中的所有重复模式

[英]Find all repeated patterns in a file

I have a file that contain a set of a few thousand unique words/terms.我有一个包含几千个独特单词/术语的文件。 It looks like:看起来像:

high school teacher
high school student
library
pencil stand
college professor
college graduate

I need the list of all repeated patterns, so in this case I would need the following as the result:我需要所有重复模式的列表,因此在这种情况下,我需要以下结果:

high
school
high school
college

Is there any way in unix/vim we could achieve this?在 unix/vim 中有什么方法可以实现这一点吗?

Additional elaboration on requirement:关于要求的补充说明:

Q. Do the repeats have to be on a single line, or can they be split over several lines?问:重复必须在一行上,还是可以分成几行?

  • Ideally, each pattern should be in a new line理想情况下,每个模式都应该在一个新行中

Q. Are the patterns all word sequences (one or more words) Q. 模式都是单词序列吗(一个或多个单词)

  • Yes they are all word sequences是的,它们都是单词序列

Q. Does spacing matter within a line?问:一行中的间距重要吗? Capitalization?大写? Punctuation?标点?

  • spaces and punctuations are all counted as part of the pattern.空格和标点符号都算作模式的一部分。 We can ignore capitalisation我们可以忽略大小写

ie. IE。

  • School == School != school School == School != school
  • this pat.tern == this pat.tern != this pattern this pat.tern == this pat.tern != this pattern

This works for me (script placed in a file script.awk ):这对我script.awk (脚本放在文件script.awk ):

{
    for (i = 1; i <= NF; i++)
    {
        count[$i]++
        sequence = $i
        for (j = i + 1; j <= NF; j++)
        {
            sequence = sequence " " $j
            count[sequence]++
        }
    }
}
END {
    for (i in count)
    {
        if (count[i] > 1)
           print i
    }
}

The 'every line' code builds up the word sequences on the line and uses those to count the sequences. “每一行”代码在行上建立单词序列,并使用它们来计算序列。 The END block loops through the sequences, printing those with a count of more than one (so the word sequence was repeated). END块循环遍历序列,打印计数大于 1 的那些(因此单词序列被重复)。

Given the (extended) data file (called data ):鉴于(扩展)数据文件(称为data ):

high school teacher
high school student
library
pencil stand
college professor
college graduate
coelacanths are ancient fish
coelacanths are ancient but still alive
coelacanths are ancient and long lived
coelacanths are ancient and can live to be 100 years old
coelacanths are ancient living fossils
coelacanths can live to be ancient
coelacanths are long-lived
coelacanths are slow to mature
coelacanths are denizens of the deep sea
coelacanths can be found off Africa and Indonesia

The output of awk -f script.awk data | sort awk -f script.awk data | sort的输出awk -f script.awk data | sort awk -f script.awk data | sort is: awk -f script.awk data | sort是:

ancient
ancient and
and
are
are ancient
are ancient and
be
can
can live
can live to
can live to be
coelacanths
coelacanths are
coelacanths are ancient
coelacanths are ancient and
coelacanths can
college
high
high school
live
live to
live to be
school
to
to be

The data carefully has some longer repeated sequences of up to four words;数据仔细地有一些更长的重复序列,最多四个单词; longer word sequences would be tracked just as effectively.更长的单词序列将被同样有效地跟踪。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM