简体   繁体   English

文本列表(字符串)转换为一个Python列表

[英]converting a text list (string)to a python list

i see that this question has been asked many a time on this site but i cant find an answer that does what i need. 我看到这个问题已经在这个网站上被问过很多次了,但是我找不到能满足我需要的答案。

What i need to do is convert a very long text file (680k lines) to a list in python. 我需要做的是将一个很长的文本文件(68万行)转换为python中的列表。 the whole text file is formatted as shown below: 整个文本文件的格式如下所示:

libertarians
liberticidal
liberticide
liberticide's
liberticides

my end goal is to create a system where i replace words with their corresponding dictionary value. 我的最终目标是创建一个系统,在该系统中,我用相应的字典值替换单词。 For instance dic['apple', 'pears', 'peaches', 'cats']. 例如dic ['apple','pears','peaches','cats']。 the below code doesn't work because the list it produces can't be used in a if word in list: statement. 下面的代码不起作用,因为它产生的列表不能用在list:语句中的if词中。 i tried it. 我尝试过这个。

with open('thefile.txt') as f:
  thelist = f.readlines()

this is the entirety of the code with that as the method to retrieve the list. 这就是整个代码,并以此作为检索列表的方法。

with open('H:/Dropbox/programming/text compression/list.txt') as f:
 thelist = f.readlines()
word = input()
if word in thelist:
 print("hu")
else:
 print("l")

output with input 'apple': 1 输入为“ apple”的输出:1

in short, the list could be printed but little else. 简而言之,可以打印该列表,但仅打印其他内容。

Simplest approach: 最简单的方法:

with open('thefile.txt') as f:
    thelist = f.readlines()

680k lines means a few megabytes -- far from a MemoryError , a terror expressed in some comments!-), on any modern platform, where your available virtual memory is giga bytes (if you're running Python on a Commodore 64, that's different, but then, I'm sure you have plenty of other problems:-). 在任何可用虚拟内存为千兆字节的现代平台上,680k行意味着几兆字节- 远离 MemoryError (在某些注释中表示的恐怖!-),这是不同的(如果您在Commodore 64上运行Python,则有所不同) ,但是我确定您还有很多其他问题:-)。

The readlines method internally does the newline-stripping other approaches need to perform explicitly, and thereby's much preferable (and faster). readlines方法在内部执行换行,而其他方法需要显式执行,因此是更可取的(且更快)。 And if you need the result as a list of words, there's just no way you can save any memory by a piecemeal approach anyway. 而且,如果您需要将结果作为单词列表使用,则根本无法通过零碎的方式节省任何内存。

Added: for example, on my Macbook Air, 新增:例如,在我的Macbook Air上,

$ wc /usr/share/dict/words
235886  235886 2493109 /usr/share/dict/words

so over 1/3rd of the one the OP mentions. 因此,OP所提及的数字的三分之一以上。 Here, 这里,

>>> with open('/usr/share/dict/words') as f: wds=f.readlines()
... 
>>> sys.getsizeof(wds)
2115960

So, a bit over 2MB for well over 200k words -- checks! 因此,超过20万字的空间超过2MB-检查! Thus, for well over 600k words, I'd extrapolate "a bit over 6MB" -- vastly below the amount that might possibly cause a MemoryError in this "brave new world" (from the POV of old-timers like me:-) of machines with many gigabytes (even phones , nowadays...:-). 因此,对于超过600k的单词,我推断为“超过6MB”- 大大低于在“勇敢的新世界”中可能导致MemoryError的数量(来自像我这样的旧玩家的POV :-)数千兆字节的机器(如今甚至是电话 ,... :-)。

Plus, anyway, if that list of words is to be kept as a list of words, there's no way you're going to be spending any less than these few-megabytes piddling amounts of memory! 另外,无论如何,如果要将该单词列表保留为单词列表,那么您花费的内存绝不会少于这些几兆字节的内存! Reading files line by line and cleverly maneuvering to keep only the subset of data you need from the subset of lines you need it from is, ahem, "totally misplaced effort", when your goal is essentially to keep just about all the text from every single line -- in that particular case (which happens to meet this Q's ask!-), just use readlines and be done with it!-) 逐行读取文件并巧妙地操作以仅保留所需数据的子集,这是“完全错位的工作”,这时您的目标是基本上保留每个文本中的几乎所有文本。单行-在这种情况下(恰好满足了这个Q的问题!-),只需使用readlines并完成它!-)

Added: an edit to the Q makes it clear (though it's nowhere stated in the question!) that the lines must contain some whitespace to the right of the words, so a rstrip is needed. 补充:对Q的编辑使它很清楚(尽管问题中没有任何地方!),这些行必须在单词的右边包含一些空格,因此需要rstrip Even so, the accepted answer is not optimal. 即使这样,公认的答案也不是最优的。 Consider the following file i.py : 考虑以下文件i.py

def slow():
    list_of_words = []
    for line in open('/usr/share/dict/words'):
        line = line.rstrip()
        list_of_words.append(line)
    return list_of_words

def fast():
    with open('/usr/share/dict/words') as f:
        wds = [s.rstrip() for s in f] 
    return wds

assert slow() == fast()

where the assert in the end just verifies the fact that the two approaches product identical results. 最后的assert只是验证了两种方法产生相同结果的事实。 Now, on a Macbook Air...: 现在,在Macbook Air上...:

$ python -mtimeit -s'import i' 'i.slow()'
10 loops, best of 3: 69.6 msec per loop
$ python -mtimeit -s'import i' 'i.fast()'
10 loops, best of 3: 50.2 msec per loop

we can see that the loop approach in the accepted answer takes almost 40% more time than a list comprehension does. 我们可以看到,接受的答案中的循环方法比列表理解所花的时间多出40%。

Try like this: 尝试这样:

with open('file') as f:
    my_list = [x.strip() for x in f]

You can also do your work on the fly insted of storing all the lines: 您还可以随时存储所有行,以完成工作:

with open('file') as f:
    for x in f:
        # do your stuff here on x

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM