从python中的文本文件中计算列表中出现和不出现特殊字符的所有元素

Question

I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. 如果以前已经回答过这个问题，我真的很抱歉，但是我一直在搜索SO和Google几个小时，以了解如何正确执行此操作。 It should be easy and I know I am missing something simple. 这应该很容易，我知道我缺少一些简单的东西。

I am trying to read from a file and count all occurrences of elements from a list. 我正在尝试从文件中读取内容并计算列表中元素的所有出现次数。 This list is not just whole words though. 但是，这个列表并不只是完整的单词。 It has special characters and punctuation that I need to get as well. 它也需要具有特殊字符和标点符号。

This is what I have so far, I have been trying various ways and this post got me the closest: Python - Finding word frequencies of list of words in text file 这是到目前为止，我一直在尝试各种方法，而这篇文章使我最接近： Python-查找文本文件中单词列表的单词频率

So I have a file that contains a couple of paragraphs and my list of strings is: 所以我有一个包含几个段落的文件，我的字符串列表是：

listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']

My full code is: 我的完整代码是：

#!/usr/bin/python

import re
from collections import Counter

f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()


for word in words:
  if word in wanted:
    print word
    cnt[word] += 1

print cnt

my output thus far looks like: 到目前为止，我的输出看起来像：

the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})

It is counting my "the" strings with punctuation but not counting them as separate counters. 它使用标点符号计数我的“ the”字符串，但不将其作为单独的计数器计数。 I know it is because of the \\W+. 我知道是因为\\ W +。 I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way. 我只是不确定在这里使用什么合适的正则表达式模式，或者我是否打算使用错误的方式。

Answer 1

I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. 我怀疑您的特定问题可能还有一些其他细节，为简单起见，在此不再赘述。 However, I'll assume that what you are looking for is to find a given word, eg "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. 但是，我假设您要查找的是找到给定的单词，例如“ the”，该单词可以具有大写或小写的首字母，并且可以在其前后加上空格或标点符号;，。！'等字符。 You want to count the number of all the distinct instances of this general pattern. 您要计算该常规模式的所有不同实例的数量。

I would define a single (non-disjunctive) regular expression that define this. 我将定义一个单个（非析取式）正则表达式来定义它。 Something like this 像这样

import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")

(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. ) （一般来说，这可能与您所寻找的不完全相同。我只是假设它基于您上面所说的内容。）

Now, let's say our text is 现在，假设我们的文字是

text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''

We could do 我们可以做

matches = pattern.findall(text)

where matches will be 比赛将在哪里

[' the ',
 ';the,',
 ' The ',
 "'the ",
 ' the;',
 " the'",
 ' the;',
 ' the,',
 ' the.']

And then you just count. 然后，您只需数数即可。

from collections import Counter
count = Counter()
for match in matches:
    count[match] += 1

which in this case would lead to 在这种情况下会导致

Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})

As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want. 正如我刚开始所说的那样，这可能并不是您想要的，但是希望您可以对其进行修改以获得所需的内容。

Just to add, a difficulty with using a disjunctive regular expression like 补充一点，使用析取正则表达式（例如

'the|the;|the,|the!'

is that the strings like "the," and "the;" 是像“ the”和“ the;”这样的字符串 will also match the first option, ie "the", and that will be returned as the match. 也将匹配第一个选项，即“ the”，并将其作为匹配项返回。 Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general. 即使可以通过更仔细地选择选项来避免此问题，但我认为总体上可能并不容易。

Answer 2

The simplest option is to combine all "wanted" strings into one regular expression: 最简单的选择是将所有“需要的”字符串组合成一个正则表达式：

rr = '|'.join(map(re.escape, wanted))

and then find all matches in the text using re.findall . 然后使用re.findall查找文本中的所有匹配re.findall 。

To make sure longer stings match first, just sort the wanted list by length: 要确保更长的st先匹配，只需将wanted列表按长度排序：

wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted))

从python中的文本文件中计算列表中出现和不出现特殊字符的所有元素

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-05-23 02:05:57

解决方案2
0 2014-05-22 23:04:10

从python中的文本文件中计算列表中出现和不出现特殊字符的所有元素

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-05-23 02:05:57

解决方案2 0 2014-05-22 23:04:10

解决方案1
1 已采纳 2014-05-23 02:05:57

解决方案2
0 2014-05-22 23:04:10