简体   繁体   English

计算文件中的特定字符(Python)

[英]Counting specific characters in a file (Python)

I'd like to count specific things from a file, ie how many times "--undefined--" appears. 我想计算文件中的特定内容,即出现"--undefined--"次数。 Here is a piece of the file's content: 这是文件内容的一部分:

"jo:ns  76.434
pRE     75.417
zi:     75.178
dEnt    --undefined--
ba      --undefined--

I tried to use something like this. 我试图使用这样的东西。 But it won't work: 但这是行不通的:

with open("v3.txt", 'r') as infile:
    data = infile.readlines().decode("UTF-8")

    count = 0
    for i in data:
        if i.endswith("--undefined--"):
            count += 1
    print count

Do I have to implement, say, dictionary of tuples to tackle this or there is an easier solution for that? 我是否必须实施元组字典来解决这个问题,或者有一个更简单的解决方案?

EDIT: 编辑:

The word in question appears only once in a line. 有问题的单词仅在一行中出现一次。

you can read all the data in one string and split the string in a list, and count occurrences of the substring in that list. 您可以读取一个字符串中的所有数据并将该字符串拆分为一个列表,然后计算该列表中子字符串的出现次数。

with open('afile.txt', 'r') as myfile:
    data=myfile.read().replace('\n', ' ')

data.split(' ').count("--undefined--")

or directly from the string : 或直接从字符串:

data.count("--undefined--")

readlines() returns the list of lines, but they are not stripped (ie. they contain the newline character). readlines()返回行列表,但是它们不会被剥离(即,它们包含换行符)。 Either strip them first: 要么先剥掉它们:

data = [line.strip() for line in data]

or check for --undefined--\\n : 或检查--undefined--\\n

if line.endswith("--undefined--\n"):

Alternatively, consider string's .count() method: 或者,考虑字符串的.count()方法:

file_contents.count("--undefined--")

Or don't limit yourself to .endswith() , use the in operator. 或不要将自己限制为.endswith() ,请使用in运算符。

data = ''
count = 0

with open('v3.txt', 'r') as infile:
    data = infile.readlines()
print(data)

for line in data:
    if '--undefined--' in line:
        count += 1

count

Quoting Raymond Hettinger, "There must be a better way": 引用Raymond Hettinger的话,“必须有更好的方法”:

from collections import Counter

counter = Counter()
words = ('--undefined--', 'otherword', 'onemore')

with open("v3.txt", 'r') as f:
    lines = f.readlines()
    for line in lines:
        for word in words:
            if word in line:
                counter.update((word,))  # note the single element tuple

print counter

When reading a file line by line, each line ends with the newline character: 逐行读取文件时,每行以换行符结尾:

>>> with open("blookcore/models.py") as f:
...    lines = f.readlines()
... 
>>> lines[0]
'# -*- coding: utf-8 -*-\n'
>>> 

so your endswith() test just can't work - you have to strip the line first: 因此您的endswith()测试无法正常工作-您必须先删除该行:

if i.strip().endswith("--undefined--"):
    count += 1

Now reading a whole file in memory is more often than not a bad idea - even if the file fits in memory, it still eats fresources for no good reason. 现在,在内存中读取整个文件通常不是一个坏主意-即使该文件适合内存,它仍然没有充分的理由吃掉资源。 Python's file objects are iterable, so you can just loop over your file. Python的file对象是可迭代的,因此您可以循环遍历文件。 And finally, you can specify which encoding should be used when opening the file (instead of decoding manually) using the codecs module (python 2) or directly (python3): 最后,您可以指定使用codecs模块(python 2)或直接(python3)打开文件(而不是手动解码)时应使用哪种编码:

# py3
with open("your/file.text", encoding="utf-8") as f:

# py2:
import codecs
with codecs.open("your/file.text", encoding="utf-8") as f:

then just use the builtin sum and a generator expression: 然后只需使用内置的sum和生成器表达式即可:

result = sum(line.strip().endswith("whatever") for line in f)

this relies on the fact that booleans are integers with values 0 ( False ) and 1 ( True ). 这取决于布尔值是值为0False )和1True )的整数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM