简体   繁体   中英

Python Efficient way of parsing string to list of floats from a file

This document has a word and tens of thousands of floats per line, I want to transform it to a dictionary with the word as key and a vector with all the floats. That is how I am doing, but due to the size of the file (about 20k lines each one with about 10k values) the process is taking a bit too long. I could not find a more efficient way of doing the parsing. Just some alternative ways that were not guaranteed to decrease run time.

with open("googlenews.word2vec.300d.txt") as g_file:
  i = 0;
  #dict of words: [lots of floats]
  google_words = {}

  for line in g_file:
    google_words[line.split()[0]] = [float(line.split()[i]) for i in range(1, len(line.split()))]

In your solution you preform slow line.split() for every word, twice. Consider following modification:

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        word, *numbers = line.split()
        google_words[word] = [float(number) for number in numbers]

One advanced concept I used here is "unpacking": word, *numbers = line.split()

Python allows to unpack iterable values into multilple variables:

a, b, c = [1, 2, 3]
# This is practically equivalent to
a = 1
b = 2
c = 3

The * is a shortcut for "take the leftovers, put them in the list and assign the list to the name":

a, *rest = [1, 2, 3, 4]
# results in
a == 1
rest == [2, 3, 4]

Just don't call line.split() more than once.

with open("googlenews.word2vec.300d.txt") as g_file:
    i = 0;
    #dict of words: [lots of floats]
    google_words = {}

    for line in g_file:
        temp = line.split()
        google_words[temp[0]] = [float(temp[i]) for i in range(1, len(temp))]

Here's a simple generator of such file:

s = "x"
for i in range (10000):
    s += " 1.2345"
print (s)

The former version takes some time. The version with only one split call is instant.

You could also use the csv module, which should be more efficient that what you are doing.

It would be something like:

import csv

d = {}
with (open("huge_file_so_huge.txt", "r")) as g_file:
    for row in csv.reader(g_file, delimiter=" "):
        d[row[0]] = list(map(float, row[1:]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM