[英]Find all repeated patterns in a file
I have a file that contain a set of a few thousand unique words/terms.我有一个包含几千个独特单词/术语的文件。 It looks like:
看起来像:
high school teacher
high school student
library
pencil stand
college professor
college graduate
I need the list of all repeated patterns, so in this case I would need the following as the result:我需要所有重复模式的列表,因此在这种情况下,我需要以下结果:
high
school
high school
college
Is there any way in unix/vim we could achieve this?在 unix/vim 中有什么方法可以实现这一点吗?
Additional elaboration on requirement:关于要求的补充说明:
Q. Do the repeats have to be on a single line, or can they be split over several lines?问:重复必须在一行上,还是可以分成几行?
Q. Are the patterns all word sequences (one or more words) Q. 模式都是单词序列吗(一个或多个单词)
Q. Does spacing matter within a line?问:一行中的间距重要吗? Capitalization?
大写? Punctuation?
标点?
ie. IE。
School
== School
!= school
School
== School
!= school
this pat.tern
== this pat.tern
!= this pattern
this pat.tern
== this pat.tern
!= this pattern
This works for me (script placed in a file script.awk
):这对我
script.awk
(脚本放在文件script.awk
):
{
for (i = 1; i <= NF; i++)
{
count[$i]++
sequence = $i
for (j = i + 1; j <= NF; j++)
{
sequence = sequence " " $j
count[sequence]++
}
}
}
END {
for (i in count)
{
if (count[i] > 1)
print i
}
}
The 'every line' code builds up the word sequences on the line and uses those to count the sequences. “每一行”代码在行上建立单词序列,并使用它们来计算序列。 The
END
block loops through the sequences, printing those with a count of more than one (so the word sequence was repeated). END
块循环遍历序列,打印计数大于 1 的那些(因此单词序列被重复)。
Given the (extended) data file (called data
):鉴于(扩展)数据文件(称为
data
):
high school teacher
high school student
library
pencil stand
college professor
college graduate
coelacanths are ancient fish
coelacanths are ancient but still alive
coelacanths are ancient and long lived
coelacanths are ancient and can live to be 100 years old
coelacanths are ancient living fossils
coelacanths can live to be ancient
coelacanths are long-lived
coelacanths are slow to mature
coelacanths are denizens of the deep sea
coelacanths can be found off Africa and Indonesia
The output of awk -f script.awk data | sort
awk -f script.awk data | sort
的输出awk -f script.awk data | sort
awk -f script.awk data | sort
is: awk -f script.awk data | sort
是:
ancient
ancient and
and
are
are ancient
are ancient and
be
can
can live
can live to
can live to be
coelacanths
coelacanths are
coelacanths are ancient
coelacanths are ancient and
coelacanths can
college
high
high school
live
live to
live to be
school
to
to be
The data carefully has some longer repeated sequences of up to four words;数据仔细地有一些更长的重复序列,最多四个单词; longer word sequences would be tracked just as effectively.
更长的单词序列将被同样有效地跟踪。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.