简体   繁体   English

如何替换pandas.Series中的词干句子?

[英]How can I replace the stem word sentence in pandas.Series?

Here I got a pandas.series named 'traindata'. 在这里,我得到了一个名为“ traindata”的pandas.series。

    0       Published: 4:53AM Friday August 29, 2014 Sourc...
    1       8  Have your say\n\n\nPlaying low-level club c...
    2       Rohit Shetty has now turned producer. But the ...
    3       A TV reporter in Serbia almost lost her job be...
    4       THE HAGUE -- Tony de Brum was 9 years old in 1...
    5       Australian TV cameraman Harry Burton was kille...
    6       President Barack Obama sharply rebuked protest...
    7       The car displaying the DIE FOR SYRIA! sticker....
    8       \nIf you've ever been, you know that seeing th...
    9       \nThe former executive director of JBWere has ...
    10      Waterloo Road actor Joe Slater has revealed hi...
                        ... 
    **Name: traindata, Length: 2284, dtype: object**

and what I want to do is to replace the series.values with the stemmed sentences. 我想做的是用词干句子替换series.values。

my thought is to build a new series and put the stemmed sentence in. my code is as below: 我的想法是建立一个新系列,并添加词干句子。我的代码如下:

    from nltk.stem.porter import PorterStemmer

    stem_word_data = np.zeros([2284,1])
    ps = PorterStemmer()
    for i in range(0,len(traindata)):
        tst = word_tokenize(traindata[i]) 
        for word in tst:
            word = ps.stem(word)    
            stem_word_data[i] = word

and then an error occurs: 然后发生错误:

    ValueError: could not convert string to float: 'publish'

Anyone knows how to fix this error or anyone has a better idea on how to replace the series.values with the stemmed sentence? 任何人都知道如何解决此错误,或者有人对如何用词干句子替换series.values有更好的主意? thanks. 谢谢。

You can use apply on a series and avoid writing loops. 您可以对一系列apply并避免编写循环。

from nltk import word_tokenize
from nltk.stem import PorterStemmer

## intialise stemmer class
pst = PorterStemmer()

## sample data frame
df = pd.DataFrame({'senten': ['I am not dancing','You are playing']})

## apply here
df['senten'] = df['senten'].apply(word_tokenize)
df['senten'] = df['senten'].apply(lambda x: ' '.join([pst.stem(y) for y in x]))

print(df)

          senten
0  I am not danc
1   you are play

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM