[英]How to count each letters from a file?
I have a cord.txt file as shown below,我有一个cord.txt文件,如下所示,
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
I need to count each letters and have to make a summary as shown below (expected output),我需要计算每个字母,并且必须进行如下所示的总结(预期输出),
H,4
D,5
E,4
T,1
I know how to count each letters by using grep "<letter>" cord.txt | wc
我知道如何使用grep "<letter>" cord.txt | wc
来计算每个字母grep "<letter>" cord.txt | wc
grep "<letter>" cord.txt | wc
. grep "<letter>" cord.txt | wc
。 But I have a huge file which contains more number of letters, therefore please help me to do the same.但是我有一个包含更多字母的大文件,因此请帮助我做同样的事情。
Thanks in advance.提前致谢。
You're missing the N
:-)你错过了N
:-)
grep -o '[[:alpha:]]' cord.txt | sort | uniq -c
grep -o
only outputs the matching part. grep -o
只输出匹配的部分。 With the POSIX class [[:alpha:]]
, it outputs all the letters contained in the input.使用 POSIX 类[[:alpha:]]
,它输出输入中包含的所有字母。sort
groups the same letters together sort
将相同的字母组合在一起uniq -c
reports unique lines with their counts. uniq -c
报告独特的行及其计数。 It needs sorted input, as it only compares the current line to the previous one.它需要排序输入,因为它只将当前行与前一行进行比较。The following command以下命令
sed 's/[^a-zA-Z]//g' < input.txt | fold -w 1 -s | sort | uniq -c > output.txt
# ^ ^ ^ ^
# 1. 2. 3. 4.
Input:输入:
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
output:输出:
5 D
4 E
4 H
1 N
1 T
You might use python's collections.Counter
as follows, let cord.txt
content be你可以使用python的collections.Counter
如下,让cord.txt
内容为
188H,190D,245H
187D,481E,482T
187H,194E,196D
386D,388E,389N,579H
44E,60D
and counting.py
be和counting.py
是
import collections
counter = collections.Counter()
with open("cord.txt", "r") as f:
for line in f:
counter.update(i for i in line if i.isalpha())
for char, cnt in counter.items():
print("{},{}".format(char,cnt))
then python counting.py
output然后python counting.py
输出
H,4
D,5
E,4
T,1
N,1
Note that I used for line in f
where f
is file-handle to avoid loading whole file into memory.请注意,我for line in f
中使用for line in f
其中f
是文件句柄以避免将整个文件加载到内存中。 Disclaimer: I used python version 3.7
, older should work but might give other order in output, as collections.Counter
is subclass of dict
and these do not keep order in older python versions.免责声明:我使用了 python 版本3.7
,旧版应该可以工作,但可能会在输出中给出其他顺序,因为collections.Counter
是dict
子类,并且这些在旧版 python 中不保持顺序。
Shortly:不久:
tr '[0-9],' \\n <input | sort | uniq -c
43
5 D
4 E
4 H
1 N
1 T
Ok, there are 43 other characters... You could drop and match your request by adding sed
:好的,还有 43 个其他字符...您可以通过添加sed
来删除和匹配您的请求:
tr '[0-9],' \\n </tmp/so/input | sort | uniq -c |
sed -ne 's/^ *\([0-9]\+\) \(.\)/\2,\1/p'
D,5
E,4
H,4
N,1
T,1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.