简体   繁体   中英

Count all occurrences of elements with and without special characters in a list from a text file in python

I really apologize if this has been answered before but I have been scouring SO and Google for a couple of hours now on how to properly do this. It should be easy and I know I am missing something simple.

I am trying to read from a file and count all occurrences of elements from a list. This list is not just whole words though. It has special characters and punctuation that I need to get as well.

This is what I have so far, I have been trying various ways and this post got me the closest: Python - Finding word frequencies of list of words in text file

So I have a file that contains a couple of paragraphs and my list of strings is:

listToCheck = ['the','The ','the,','the;','the!','the\'','the.','\'the']

My full code is:

#!/usr/bin/python

import re
from collections import Counter

f = open('text.txt','r')
wanted = ['the','The ','the,','the;','the!','the\'','the.','\'the']
words = re.findall('\w+', f.read().lower())
cnt = Counter()


for word in words:
  if word in wanted:
    print word
    cnt[word] += 1

print cnt

my output thus far looks like:

the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
the
Counter({'the': 17})

It is counting my "the" strings with punctuation but not counting them as separate counters. I know it is because of the \\W+. I am just not sure what the proper regex pattern to use here or if I'm going about this the wrong way.

I suspect there may be some extra details to your specific problem that you are not describing here for simplicity. However, I'll assume that what you are looking for is to find a given word, eg "the", which could have either an upper or lower case first letter, and can be preceded and followed either by a whitespace or by some punctuation characters such as ;,.!'. You want to count the number of all the distinct instances of this general pattern.

I would define a single (non-disjunctive) regular expression that define this. Something like this

import re
pattern = re.compile(r"[\s',;.!][Tt]he[\s.,;'!]")

(That might not be exactly what you are looking for in general. I just assuming it is based on what you stated above. )

Now, let's say our text is

text = '''
Foo and the foo and ;the, foo. The foo 'the and the;
and the' and the; and foo the, and the. foo.
'''

We could do

matches = pattern.findall(text)

where matches will be

[' the ',
 ';the,',
 ' The ',
 "'the ",
 ' the;',
 " the'",
 ' the;',
 ' the,',
 ' the.']

And then you just count.

from collections import Counter
count = Counter()
for match in matches:
    count[match] += 1

which in this case would lead to

Counter({' the;': 2, ' the.': 1, ' the,': 1, " the'": 1, ' The ': 1, "'the ": 1, ';the,': 1, ' the ': 1})

As I said at the start, this might not be exactly what you want, but hopefully you could modify this to get what you want.

Just to add, a difficulty with using a disjunctive regular expression like

'the|the;|the,|the!'

is that the strings like "the," and "the;" will also match the first option, ie "the", and that will be returned as the match. Even though this problem could be avoided by more careful ordering of the options, I think it might not be easier in general.

The simplest option is to combine all "wanted" strings into one regular expression:

rr = '|'.join(map(re.escape, wanted)) 

and then find all matches in the text using re.findall .

To make sure longer stings match first, just sort the wanted list by length:

wanted.sort(key=len, reverse=True)
rr = '|'.join(map(re.escape, wanted)) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM