如何使用 Python 从文本文件中返回唯一的单词

Question

如何使用 Python 从文本文件中返回所有唯一单词？ 例如：

我不是机器人

我是人

应该返回：

一世

是

不是

一种

机器人

人类

这是我到目前为止所做的：

def unique_file(input_filename, output_filename):
    input_file = open(input_filename, 'r')
    file_contents = input_file.read()
    input_file.close()
    word_list = file_contents.split()

    file = open(output_filename, 'w')

    for word in word_list:
        if word not in word_list:
            file.write(str(word) + "\n")
    file.close()

Python 创建的文本文件中没有任何内容。 我不确定我做错了什么

Answer 1

for word in word_list:
    if word not in word_list:

根据第一行的定义，每个word都在word_list 。

使用set代替该逻辑：

unique_words = set(word_list)
for word in unique_words:
    file.write(str(word) + "\n")

set只包含唯一成员，这正是您想要实现的。

请注意，订单不会被保留，但您没有指定这是否是一项要求。

Answer 2

只需遍历文件中的行并使用 set 仅保留唯一的行。

from itertools import chain

def unique_words(lines):
    return set(chain(*(line.split() for line in lines if line)))

然后只需执行以下操作即可从文件中读取所有唯一行并打印它们

with open(filename, 'r') as f:
    print(unique_words(f))

Answer 3

这似乎是一个集合的典型应用：

...
import collections
d = collections.OrderedDict()
for word in wordlist: d[word] = None 
# use this if you also want to count the words:
# for word in wordlist: d[word] = d.get(word, 0) + 1 
for k in d.keys(): print k

您还可以使用 collection.Counter()，它还会计算您输入的元素。但是单词的顺序会丢失。 我添加了一行用于计数和保持订单。

Answer 4

string = "I am not a robot\n I am a human"
list_str = string.split()
print list(set(list_str))

Answer 5

def unique_file(input_filename, output_filename):
    input_file = open(input_filename, 'r')
    file_contents = input_file.read()
    input_file.close()
    duplicates = []
    word_list = file_contents.split()
    file = open(output_filename, 'w')
    for word in word_list:
        if word not in duplicates:
            duplicates.append(word)
            file.write(str(word) + "\n")
    file.close()

这段代码遍历每个单词，如果它不在一个列表中， duplicates ，它会附加这个单词并将其写入文件。

Answer 6

使用正则表达式和设置：

import re
words = re.findall('\w+', text.lower())
uniq_words = set(words)

另一种方法是创建一个 Dict 并插入像键这样的词：

for i in range(len(doc)):
        frase = doc[i].split(" ")
        for palavra in frase:
            if palavra not in dict_word:
                dict_word[palavra] = 1
print dict_word.keys()

Answer 7

您的代码的问题是 word_list 已经包含输入文件的所有可能单词。 迭代循环时，您基本上是在检查 word_list 中的单词本身是否不存在。 所以它永远是假的。 这应该可以工作..（请注意，这也将保留顺序）。

def unique_file(input_filename, output_filename):
  z = []
  with open(input_filename,'r') as fileIn, open(output_filename,'w') as fileOut:
      for line in fileIn:
          for word in line.split():
              if word not in z:
                 z.append(word)
                 fileOut.write(word+'\n')

Answer 8

使用一套。 您无需导入任何内容即可执行此操作。

#Open the file
my_File = open(file_Name, 'r')
#Read the file
read_File = my_File.read()
#Split the words
words = read_File.split()
#Using a set will only save the unique words
unique_words = set(words)
#You can then print the set as a whole or loop through the set etc
for word in unique_words:
     print(word)

Answer 9

try:
    with open("gridlex.txt",mode="r",encoding="utf-8")as india:

        for data in india:
            if chr(data)==chr(data):
                print("no of chrats",len(chr(data)))
            else:
                print("data")
except IOError:
    print("sorry")

如何使用 Python 从文本文件中返回唯一的单词

问题描述

9 个解决方案

解决方案1
16 2014-04-10 04:28:14

解决方案2
5 2014-04-10 04:54:10

解决方案3
2 2014-04-10 05:35:31

解决方案4
2 2017-10-13 12:07:30

解决方案5
1 2014-04-10 04:29:15

解决方案6
1 2016-10-13 23:03:30

解决方案7
0 2014-04-10 04:41:24

解决方案8
0 2017-09-05 23:13:29

解决方案9
-2 2019-05-17 16:56:01

如何使用 Python 从文本文件中返回唯一的单词

问题描述

9 个解决方案

解决方案1 16 2014-04-10 04:28:14

解决方案2 5 2014-04-10 04:54:10

解决方案3 2 2014-04-10 05:35:31

解决方案4 2 2017-10-13 12:07:30

解决方案5 1 2014-04-10 04:29:15

解决方案6 1 2016-10-13 23:03:30

解决方案7 0 2014-04-10 04:41:24

解决方案8 0 2017-09-05 23:13:29

解决方案9 -2 2019-05-17 16:56:01

解决方案1
16 2014-04-10 04:28:14

解决方案2
5 2014-04-10 04:54:10

解决方案3
2 2014-04-10 05:35:31

解决方案4
2 2017-10-13 12:07:30

解决方案5
1 2014-04-10 04:29:15

解决方案6
1 2016-10-13 23:03:30

解决方案7
0 2014-04-10 04:41:24

解决方案8
0 2017-09-05 23:13:29

解决方案9
-2 2019-05-17 16:56:01