简体   繁体   English

string包含细长的单词

[英]string contains elongated words

My string is: "sooo dear how areeeee youuuuuu" 我的字符串是:“sooo dear how areeeee youuuuuu”

I want to check if the words in the string are elongated or not. 我想检查字符串中的单词是否拉长。

Elongated means: if the number of characters in the word is repeated more than twice so for example, too is not elongated but tooo is elongated. 细长意味着:如果单词中的字符数重复超过两次,例如,也不会拉长,但是太长。

>>> import itertools
>>> my_str = 'soooo hiiiii whyyyy done'
>>> print [[g[0], sum(1 for _ in g[1])] for g in itertools.groupby(my_str)]
[['s', 1], ['o', 4], [' ', 1], ['h', 1], ['i', 5], [' ', 1], ['w', 1], ['h', 1], 
['y', 4], [' ', 1], ['d', 1], ['o', 1], ['n', 1], ['e', 1]]

I want to display that sooo, areeeee and youuuuuu are elongated. 我想展示那个sooo,areeeee和youuuuuu是拉长的。 I did individual character count but I want to check for every word to see if its elongated or not. 我做了个别字符计数,但我想检查每个单词,看它是否拉长。

A regex comes to mind: 一个正则表达式浮现在脑海中:

>>> my_str = 'soooo hiiiii whyyyy done'
>>> import re
>>> regex = re.compile(r"(.)\1{2}")
>>> [word for word in my_str.split() if regex.search(word)]
['soooo', 'hiiiii', 'whyyyy']

Explanation: 说明:

(.)    # Match any character, capture it in group number 1
\1{2}  # Try to match group number 1 here, twice.

Note that this algorithm will also find some unelongated words like countessship or laparohysterosalpingooophorectomy , but I guess those false positives are rare :) 请注意,这个算法还可以找到像一些unelongated话countessshiplaparohysterosalpingooophorectomy ,但我想那些误报是罕见的:)

You can use: 您可以使用:

def get_groups(word):
    return [list(g) for k, g in itertools.groupby(word)]

print [word for word in my_str.split(' ') if any(len(x) > 2 for x in get_groups(word))]

Here's how it works: get_groups turns a word into groups. 以下是它的工作原理: get_groups将一个单词转换成组。 So 'sooo' becomes [['s'], ['o', 'o', 'o']] . 所以'sooo'变成[['s'], ['o', 'o', 'o']]

We then filter all words from the given string if the length of any of the groups is more than two. 然后,如果任何组的长度超过两个,我们将过滤给定字符串中的所有单词。 This means you'll end up with all words that have three or more consecutive characters. 这意味着您最终会得到所有包含三个或更多连续字符的单词。

you have to check by the sequence and compare length, without importing anything : 您必须按顺序检查并比较长度, 而不导入任何内容

>>> filter(lambda word: len([letter for index,letter in enumerate(word) if index ==0 or word[index-1] != letter ]) == len( word), my_str.split(" "))
['done']

>>> filter(lambda word: len([letter for index,letter in enumerate(word) if index ==0 or word[index-1] != letter ]) != len( word), my_str.split(" "))
['soooo', 'hiiiii', 'whyyyy']

or import itertools and doing it with groupby : 导入itertools并使用groupby执行:

>>> filter(lambda word: len([letter for letter,gp in itertools.groupby(word) ]) == len( word), my_str.split(" "))
['done']

>>> filter(lambda word: len([letter for letter,gp in itertools.groupby(word) ]) != len( word), my_str.split(" "))
['soooo', 'hiiiii', 'whyyyy']

this last solution permit tu use also ifilter instead of filter and iter on every good or bad words . 这个最后的解决方案允许使用ifilter而不是过滤器和iter对每个好或坏的单词。 useful for stream or very big string 对流或非常大的字符串有用

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM