简体   繁体   中英

converting a text list (string)to a python list

i see that this question has been asked many a time on this site but i cant find an answer that does what i need.

What i need to do is convert a very long text file (680k lines) to a list in python. the whole text file is formatted as shown below:

libertarians
liberticidal
liberticide
liberticide's
liberticides

my end goal is to create a system where i replace words with their corresponding dictionary value. For instance dic['apple', 'pears', 'peaches', 'cats']. the below code doesn't work because the list it produces can't be used in a if word in list: statement. i tried it.

with open('thefile.txt') as f:
  thelist = f.readlines()

this is the entirety of the code with that as the method to retrieve the list.

with open('H:/Dropbox/programming/text compression/list.txt') as f:
 thelist = f.readlines()
word = input()
if word in thelist:
 print("hu")
else:
 print("l")

output with input 'apple': 1

in short, the list could be printed but little else.

Simplest approach:

with open('thefile.txt') as f:
    thelist = f.readlines()

680k lines means a few megabytes -- far from a MemoryError , a terror expressed in some comments!-), on any modern platform, where your available virtual memory is giga bytes (if you're running Python on a Commodore 64, that's different, but then, I'm sure you have plenty of other problems:-).

The readlines method internally does the newline-stripping other approaches need to perform explicitly, and thereby's much preferable (and faster). And if you need the result as a list of words, there's just no way you can save any memory by a piecemeal approach anyway.

Added: for example, on my Macbook Air,

$ wc /usr/share/dict/words
235886  235886 2493109 /usr/share/dict/words

so over 1/3rd of the one the OP mentions. Here,

>>> with open('/usr/share/dict/words') as f: wds=f.readlines()
... 
>>> sys.getsizeof(wds)
2115960

So, a bit over 2MB for well over 200k words -- checks! Thus, for well over 600k words, I'd extrapolate "a bit over 6MB" -- vastly below the amount that might possibly cause a MemoryError in this "brave new world" (from the POV of old-timers like me:-) of machines with many gigabytes (even phones , nowadays...:-).

Plus, anyway, if that list of words is to be kept as a list of words, there's no way you're going to be spending any less than these few-megabytes piddling amounts of memory! Reading files line by line and cleverly maneuvering to keep only the subset of data you need from the subset of lines you need it from is, ahem, "totally misplaced effort", when your goal is essentially to keep just about all the text from every single line -- in that particular case (which happens to meet this Q's ask!-), just use readlines and be done with it!-)

Added: an edit to the Q makes it clear (though it's nowhere stated in the question!) that the lines must contain some whitespace to the right of the words, so a rstrip is needed. Even so, the accepted answer is not optimal. Consider the following file i.py :

def slow():
    list_of_words = []
    for line in open('/usr/share/dict/words'):
        line = line.rstrip()
        list_of_words.append(line)
    return list_of_words

def fast():
    with open('/usr/share/dict/words') as f:
        wds = [s.rstrip() for s in f] 
    return wds

assert slow() == fast()

where the assert in the end just verifies the fact that the two approaches product identical results. Now, on a Macbook Air...:

$ python -mtimeit -s'import i' 'i.slow()'
10 loops, best of 3: 69.6 msec per loop
$ python -mtimeit -s'import i' 'i.fast()'
10 loops, best of 3: 50.2 msec per loop

we can see that the loop approach in the accepted answer takes almost 40% more time than a list comprehension does.

Try like this:

with open('file') as f:
    my_list = [x.strip() for x in f]

You can also do your work on the fly insted of storing all the lines:

with open('file') as f:
    for x in f:
        # do your stuff here on x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM