Python语言包

Question

[PYTHON 3.x] Hello everyone, I am working on a project in Natural language processing and need some help. [PYTHON 3.x]大家好，我正在从事自然语言处理项目，需要一些帮助。 I have created a vocabulary (list) of distinct words from all the documents. 我创建了一个包含所有文档中不同单词的词汇表（列表）。 I want to create a vector of each document against this vocabulary list. 我想针对此词汇表为每个文档创建一个向量。 (Doc_POS_words contains 100 documents, in this form Doc_POS_words[0] = 1st doc,Doc_POS_words[1] = 2nd doc and so on.) （Doc_POS_words包含100个文档，其格式为Doc_POS_words [0] =第一个文档，Doc_POS_words [1] =第二个文档，依此类推。）

Output: 输出：

# Doc_POS_words = [contains all the words of each document as below]

Doc_POS_words = [
  ['war','life','travel','live','night'], 
  ['books','stuent','travel','study','yellow'],
  ]

# myVoc = [distinct words from all the documents as below]

myVoc = [
  'war',
  'life', 
  'travel',
  'live',
  'night',
  'books',
  'student',
  'study',
  'yellow'
]

# myVoc_vector = [ need this as well ]

# Doc_POS_words_BoW = [need this for each document]

PS: I am not using NLTK because I am not working on any of the supported languages by NLTK PS：我没有使用NLTK，因为我没有使用NLTK支持的任何语言

Thanks. 谢谢。

Answer 1

Check TfidfVectorizer 检查TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Doc 1 words",
          "Doc 2 words"]
vectorizer = TfidfVectorizer(min_df=1)
vectors = vectorizer.fit_transform(corpus)

Answer 2

I'm still not sure what you are asking so I will give you some general help. 我仍然不确定您要问什么，因此我将为您提供一些一般帮助。 I think what you need is to use python sets. 我认为您需要使用python集。

https://docs.python.org/3/tutorial/datastructures.html#sets https://docs.python.org/3/tutorial/datastructures.html#sets

Here are some examples for you, using the data in your question: 以下是使用您问题中的数据的一些示例：

# create a set of the whole word list
myVocSet = set(myVoc)

for doc_words in Doc_POS_words:
  # convert from list to set
  doc_words = set(doc_words)  

  # want to find words in the doc also in the vocabulary list?
  print(myVocSet.intersection(doc_words))

  # want to find words in your doc not in the vocabulary list?
  print(doc_words.difference(myVocSet))

  # want to find words in the vocab list not used in your doc?
  print(MyVocSet.difference(myVocSet))

Here is some more to help: 这里有更多帮助：

>>> x = set(('a', 'b', 'c', 'd'))
>>> y = set(('c', 'd', 'e', 'f'))
>>>
>>> x.difference(y)
{'a', 'b'}
>>> y.difference(x)
{'f', 'e'}
>>> x.intersection(y)
{'c', 'd'}
>>> y.intersection(x)
{'c', 'd'}
>>> x.union(y)
{'a', 'b', 'd', 'f', 'e', 'c'}
>>> x.symmetric_difference(y)
{'a', 'b', 'f', 'e'}

Python语言包

问题描述

2 个解决方案

解决方案1
0 2018-04-21 16:51:23

解决方案2
0 2018-04-21 17:20:14

Python语言包

问题描述

2 个解决方案

解决方案1 0 2018-04-21 16:51:23

解决方案2 0 2018-04-21 17:20:14

解决方案1
0 2018-04-21 16:51:23

解决方案2
0 2018-04-21 17:20:14