简体   繁体   English

在一行文本文件中查找许多唯一的单词

[英]Finding a number of unique words in a line of text file

I would like to create a Python program to find the unique words in a line in text file. 我想创建一个Python程序来查找文本文件中一行中的唯一单词。

The text file "details" has following lines 文本文件“详细信息”具有以下几行

My name is crazyguy
i am studying in a college and i travel by car
my brother brings me food for eating and we will go for shopping after food.

it must return output as 它必须返回输出为

4
10 #(since i is repeated)
13 #(Since food and for are repeated)

If the code works, will it work the same way for bigger text files in mining the data? 如果该代码有效,则在挖掘数据时,较大的文本文件是否会以相同的方式工作?

with open('details.txt', 'r') as f:
    for line in f:
        print(len(set(line.split())))

You could use set traverse through all the line split to create lsit and make it to set to find unique value and find it's count 您可以使用遍历所有行拆分的set遍历创建lsit并将其设置为查找唯一值并查找其计数

with open("filename","r") as inp:
     for line in inp:
         print len(set(line.split()))

There's a whole world of solutions that are worse than TigerhawkT3's/Vignesh Kalai's solution. 整个解决方案世界都比TigerhawkT3 / Vignesh Kalai的解决方案差。 For comparison: 为了比较:

>>> timeit.timeit("len(set(string.split()))", "string=\""+string+"\"")
9.243406057357788

is their implementation. 是他们的执行。 I actually had high hopes for this one: 实际上,我对此寄予厚望:

>>> timeit.timeit("len(set(map(hash,string.split())))", "import numpy\nstring=\""+string+"\"")
14.462514877319336

because here, the set is only built over the hashes. 因为在这里, set仅建立在散列之上。 (And because the hashes are numbers, they don't need to be hashed themselves, or so I hoped. Type handling in set probably still kills me; otherwise, in theory, the number of hashes calculated would be the same as in the best solution, but there might have been less awkward PyObject juggling underneath. I was wrong.) (并且因为哈希是数字,所以它们本身不需要进行哈希处理,所以我希望如此。 set类型处理可能仍然会使我丧命;否则,从理论上讲,所计算的哈希数将与最佳哈希值相同解决方案,但下面可能没有那么麻烦的PyObject玩弄了。我错了。)

So I tried dealing with the hashes in numpy; 因此,我尝试用numpy处理哈希值; first with the raw strings, for comparison: 首先使用原始字符串,以进行比较:

>>> timeit.timeit("len(numpy.unique(string.split()))", "import numpy\nstring=\""+string+"\"")
33.38827204704285
>>> timeit.timeit("len(numpy.unique(map(hash,string.split())))", "import numpy\nstring=\""+string+"\"")
37.22595286369324
>>> timeit.timeit("len(numpy.unique(numpy.array(map(hash,string.split()))))", "import numpy\nstring=\""+string+"\"")
36.20353698730469

Last resort: A Counter might simply circumvent the reduction step. 不得已:计数器可能只是绕过了减少步骤。 But then again, Python strings are just PyObjects and you really don't gain by having a dict instead of a set : 但是再说一遍,Python字符串只是PyObjects,而拥有dict而不是set并不会使您真正受益:

>>> timeit.timeit("max(Counter(string.split()).values())==1", "from collections import Counter\nstring=\""+string+"\"")
46.88196802139282
>>> timeit.timeit("len(Counter(string.split()))", "from collections import Counter\nstring=\""+string+"\"")
44.15947103500366

By the way: Half of the time of the best solution goes into splitting: 顺便说一下:最佳解决方案的一半时间用于拆分:

>>> timeit.timeit("string.split()", "import numpy\nstring=\""+string+"\"")
4.552565097808838

and, counter-intuitively, that time even increases if you specify that you only want to split along spaces (and not all typical delimiters): 并且,与直觉相反,那个时候甚至会增加,如果你指定你只想沿着空间(而不是所有典型的分隔符)分割:

>>> timeit.timeit("string.split(' ')", "import numpy\nstring=\""+string+"\"")
4.713452100753784

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM