如何用Token標記Python中的段落列表？

Question

我目前正在學習word2vec技術，並且陷入了將我的文本數據標記化的句子中。 希望有人可以幫助我制定正確的方法。

因此，我的數據是一堆客戶的投訴記錄。 當我將數據加載到python列表中時，它變為：

text = ['this is the first sentence of the first paragraph. and this is the second sentence.','some random text in the second paragraph. and another test sentence.','here is the third paragraph. and this is another sentence','I have run out of text here. I am learning python and deep learning.','another paragraph with some random text. the this is a learning sample.','I need help implementing word2vec. this all sounds exciting.','it''s sunday and I shoudnt be learning in the first place. it''s nice and sunny here.']

我嘗試了社區中一些最常用的Sentence Tokenizer方法，這些方法均返回此錯誤：

TypeError：預期的字符串或類似字節的對象

最終，我發現了這一點：

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text[:5][4]) 
sentences

這種工作方式，但是我無法計算出要放入[] [] s中的索引，例如：5＆4，以將整個數據集（所有段落）重新標記為句子。

抱歉，如果我的問題含糊不清，請詢問是否需要澄清。

非常感謝

Answer 1

您可以在列表nltk.tokenize.word_tokenize()中使用nltk.tokenize.word_tokenize() ，如下所示：

In [112]: from nltk.tokenize import word_tokenize
In [113]: tokenized = [word_tokenize(sent) for sent in text]

輸出：

[['this',
  'is',
  'the',
  'first',
  'sentence',
  'of',
  'the',
  'first',
  'paragraph',
  '.',
  'and',
  'this',
  'is',
  'the',
  'second',
  'sentence',
  '.'],
 ['some',
  'random',
  'text',
  'in',
  'the',
  'second',
  'paragraph',
  .
  .
  .
  .
  ]]

如何用Token標記Python中的段落列表？

問題描述

1 個解決方案

解決方案1
2 已采納 2018-07-15 08:24:43

如何用Token標記Python中的段落列表？

問題描述

1 個解決方案

解決方案1 2 已采納 2018-07-15 08:24:43

解決方案1
2 已采納 2018-07-15 08:24:43