简体   繁体   中英

Python: Using .isalpha() to count specific words/characters in a word count

I've created a function which can count specific words or characters in a text file.

But I want to create a condition where the function only counts a character if it is surrounded by letters. For example in the text file.

'This test is an example, this text doesn't have any meaning. It is only an example.'

If I were to run this text through my function, testing for the count of apostrophes ('), it will return 3. However I want it to return 1, only for apostrophes within 2 letter characters (eg isn't or won't), but I want it to ignore every other apostrophe, such a single quotes, that aren't surrounded in letters.

I've tried to use the.isalpha() method but am having trouble with the syntax.

I think regular expressions would be better for this, but if you must use isalpha , something like:

s = "'This test is an example, this text doesn't have any meaning. It is only an example.'"
sum(s[i-1].isalpha() and s[i]=="'" and s[i+1].isalpha() for i in range(1,len(s)-1))

returns 1.

If you just want to discount the quotes that are enclosing the string itself, the easiest way might be just to strip those off the string before counting.

>>> text = "'This test is an example, this text doesn't have any meaning. It is only an example.'"
>>> text.strip("'").count("'")
1

Another way would be with a regular expression like \w'\w , ie letter, followed by ' , followed by letter:

>>> sum(1 for _ in re.finditer("\w'\w", text))
1

This also works for quotes inside the string:

>>> text = "Text that has a 'quote' in it."
>>> sum(1 for _ in re.finditer("\w'\w", text))
0

But it will also miss apostrophs that are not followed by another letter:

>>> text = "All the houses' windows were broken."
>>> sum(1 for _ in re.finditer("\w'\w", text))
0

As xnx already noted, the proper way to do this is with regular expressions:

import re

text = "'This test is an example, this text doesn't have any meaning. It is only an example.'"

print(len(re.findall("[a-zA-Z]'[a-zA-Z]", text)))
"""
Out:
    1
"""

Here the apostrophe in the pattern is surrounded by the set of English letters, but there are a number of predefined character sets, see the RE docs for details.

You should just use regex:

import re

text = "'This test is an example, this text doesn't have any meaning. It is only an example.'"

wordWrappedApos = re.compile(r"\w'\w")
found = re.findall(wordWrappedApos, text)
print(found)
print(len(found))

Substitute "\w" for "[A-Za-z]" if you want to make sure no numbers are in there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM