有沒有辦法提高nltk.sentiment.vader情感分析器的性能？

Question

我的文字源於社交網絡，所以你可以想象它的本質，我認為文字是干凈的，盡我所能; 進行以下消毒后：

沒有網址，沒有用戶名
沒有標點符號，沒有重音符號
沒有數字
沒有停時詞（我認為vader無論如何都這樣做）

我認為運行時是線性的，並且我不打算進行任何並行化，因為更改可用代碼需要大量的工作量。例如，對於大約1000個文本，范圍從~50 kb到~150 kb字節，需要周圍

我的機器上的運行時間大約是10分鍾。

是否有更好的方法來喂養算法以加快烹飪時間？ 代碼就像SentimentIntensityAnalyzer一樣簡單，這里是主要部分

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

Answer 1

/ 1。 您無需刪除停用詞，nltk + vader已經這樣做了。

/ 2。 除了處理開銷之外，您無需刪除標點，因為這也會影響維達的極性計算。 所以，繼續標點符號。

    >>> txt = "this is superb!"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
    >>> txt = "this is superb"
    >>> s.polarity_scores(txt)
    {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}

你將引入句子標記化，因為它會提高准確性，然后根據句子計算段落的平均極性。例如： https ： //github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment /vaderSentiment.py#L517

/ 4。 極性計算完全相互獨立，並且可以使用小型（例如10）的多處理池來提供良好的速度提升。

polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

有沒有辦法提高nltk.sentiment.vader情感分析器的性能？

問題描述

1 個解決方案

解決方案1
1 已采納 2017-08-09 13:21:20

有沒有辦法提高nltk.sentiment.vader情感分析器的性能？

問題描述

1 個解決方案

解決方案1 1 已采納 2017-08-09 13:21:20

解決方案1
1 已采納 2017-08-09 13:21:20