简体   繁体   English

Stemmer function 接受一个字符串并返回列表中每个单词的词干

[英]Stemmer function that takes a string and returns the stems of each word in a list

I am trying to create this function which takes a string as input and returns a list containing the stem of each word in the string.我正在尝试创建这个 function ,它将一个字符串作为输入并返回一个包含字符串中每个单词的词干的列表。 The problem is, that using a nested for loop, the words in the string are appended multiple times in the list.问题是,使用嵌套的 for 循环,字符串中的单词会在列表中多次附加。 Is there a way to avoid this?有没有办法避免这种情况?

def stemmer(text):
    
    stemmed_string = []
    res = text.split()
    suffixes = ('ed', 'ly', 'ing')
    
    for word in res:
            for i in range(len(suffixes)):
                if word.endswith(suffixes[i]):
                    stemmed_string.append(word[:-len(suffixes[i])])
                elif len(word) > 8:
                    stemmed_string.append(word[:8])
                else:
                    stemmed_string.append(word)
    
    return stemmed_string

If I call the function on this text ('I have a dog is barking') this is the output:如果我在此文本上调用 function(“我有一只狗在吠叫”),这是 output:

['I',
 'I',
 'I',
 'have',
 'have',
 'have',
 'a',
 'a',
 'a',
 'dog',
 'dog',
 'dog',
 'that',
 'that',
 'that',
 'is',
 'is',
 'is',
 'barking',
 'barking',
 'bark']

You are appending something in each round of the loop over suffixes.您在后缀的每一轮循环中附加一些内容。 To avoid the problem, don't do that.为避免此问题,请不要这样做。

It's not clear if you want to add the shortest possible string out of a set of candidates, or how to handle stacked suffixes.目前尚不清楚您是否想从一组候选字符串中添加最短的字符串,或者如何处理堆叠的后缀。 Here's a version which always strips as much as possible.这是一个总是尽可能多地剥离的版本。

def stemmer(text):
    stemmed_string = []
    suffixes = ('ed', 'ly', 'ing')
    
    for word in text.split():
        for suffix in suffixes:
            if word.endswith(suffix):
                word = word[:-len(suffix)]
        stemmed_string.append(word)
    
    return stemmed_string

Notice the fixed syntax for looping over a list, too.请注意循环列表的固定语法。

This will reduce "sparingly" to "spar", etc. Like every naïve stemmer, this will also do stupid things with words like "sly" and "thing".这会将“sparingly”减少为“spar”等。就像每个幼稚的词干分析器一样,这也会用“sly”和“thing”之类的词做一些愚蠢的事情。

Demo: https://ideone.com/a7FqBp演示: https://ideone.com/a7FqBp

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:带列表和分隔符并返回字符串的函数 - Python: Function that takes a list and a separator and returns a string NLTK词干分析器返回NoneTypes列表 - NLTK stemmer returns a list of NoneTypes 将搬运工词干应用到每个单词的Pandas列 - apply porters stemmer to a Pandas column for each word 获取字符串并返回8个字符的字符串列表的函数 - Function that takes a string and returns a list of 8-character strings 需要列表并返回值的函数? - Function that takes a list and returns value? Function 接受一串字母,从有效单词列表中输出单词列表,然后找到拼字游戏得分最高的单词 - Function that takes a string of letters, outputs a list of words from a Valid Word List then finds the word with the highest scrabble score 如何编写一个接受字符串并返回该字符串中的第一个单词的函数 - How do I Write a function that takes in a string and returns the first word in that string 构建一个 function 以字符串列表作为输入,返回一个 boolean 指示是否所有字符串都包含一个单词 - Build a function that takes a list of strings as input, returns a boolean indicating whether all the strings containing a word 定义采用字符串值的函数,在数据框列中搜索它,如果它在列中并包含单词“Sales”,则返回 TRUE - Define function that takes string value, searches for it in dataframe column, and returns TRUE if it is in the column and contains the word "Sales" 编写一个函数,该函数接受一个单词列表并返回最长单词和最长单词的长度 - Write a function that takes a list of words and returns the longest word and length of the longest one
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM