简体   繁体   English

Linux Shell脚本来计算文本文件中char序列的出现?

[英]Linux shell script to count occurance of char sequence in a text file?

I have aa large text file (over 70mb) and need to count the number of times a character sequence occurs in the file. 我有一个大的文本文件(超过70mb),需要计算文件中字符序列出现的次数。 I can find plenty of scripts to do this, but NONE OF THEM take in to account that a sequence can start and finish on different lines. 我可以找到很多脚本来执行此操作,但是没有一个脚本考虑到序列可以在不同的行上开始和结束。 For the sake of efficiency (I actually have way more than 1 file I am processing), I can not preprocess the files to remove newlines. 为了提高效率(实际上我正在处理多个文件),我无法预处理文件以删除换行符。

Example: If I am searching for "thisIsTheSequence", the following file would have 3 matches: 示例:如果我搜索“ thisIsTheSequence”,则以下文件将具有3个匹配项:

asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

Thanks for the help. 谢谢您的帮助。

One option: 一种选择:

echo $((`tr -d "\n" < file | sed 's/thisIsTheSequence/\n/g' | wc -l` - 1))

There are probably more efficient methods using utilities outside the core of shell - particularly if you can fit the file in memory. 在外壳程序核心之外使用实用程序可能有更有效的方法-特别是如果您可以将文件放入内存中的话。

just one awk script will do, since you will processing a huge file. 只需一个awk脚本即可完成,因为您将处理一个巨大的文件。 Doing multiple pipes can slow down things. 做多条管道会减慢速度。

#!/bin/bash
awk 'BEGIN{
 search="thisIsTheSequence"
 total=0
}
NR%10==0{
  c=gsub(search,"",s)
  total+=c  
}
NR{ s=s $0 }
END{ 
 c=gsub(search,"",s)
 print "total count: "total+c
}' file

output 输出

$ more file
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasdaasdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

$ ./shell.sh
total count: 9

Is there ever going to be more than one newline in your sequence? 您的序列中是否会有不止一个换行符?

If not, one solution would be to split your sequence in half and search for the halves (eg search for "thisIsTh" and also for "eSequence"), then go back to the occurrences you find and take a "closer look", ie strip out the newlines in that area and check for a match. 如果不是,一种解决方案是将序列分成两半并搜索一半(例如,搜索“ thisIsTh”和“ eSequence”),然后返回找到的位置并进行“仔细观察”,即删除该区域中的换行符并检查是否匹配。

Basically this is a kind of fast "filtering" of the data to find something interesting. 基本上,这是一种对数据进行快速“过滤”以找到有趣的东西的方法。

use something like: 使用类似:

head -n LL filename | tail -n YY | grep text | wc -l

where LL is the last line of the sequence and YY is the number of lines in the sequence (ie LL - first line) 其中LL是序列的最后一行,而YY是序列中的行数(即LL-第一行)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM