简体   繁体   English

如何以最快的方式将 python 列表附加到 numpy 矩阵?

[英]How to append python list to a numpy matrix in fastest way?

I am writing a code to read research data which have up to billion lines.我正在编写代码来读取多达十亿行的研究数据。 I have to read data line by line because the data have multiple blocks.我必须逐行读取数据,因为数据有多个块。 Each block has headers which are different from other block headers and datasets.每个块都有不同于其他块头和数据集的头。 I hope to read those datasets into a Numpy matrix so I can perform matrix operations.我希望将这些数据集读入 Numpy 矩阵,以便我可以执行矩阵运算。 Here are essential codes.以下是基本代码。

    with open(datafile, "r") as dump:
        i = 0           # block line number
        line_no = 0     # total line number
        block_size = 0
        block_count = 0
        for line in dump:
            values = line.rstrip().rsplit()
            i += 1
            line_no += 1
            if i <= self.head_line_no:
                print(line)  # for test
                if self.tag_block in line or i == 1:      # 1st line of a block
                    # save block size after reading 1st block
                    if block_size == 0 and block_count == 0:
                        block_size = line_no - 1
                        i = 1               # reset block line number
                        self.box = []       # reset box constant
                        print(self.matrix)
                        self.matrix = np.zeros((0, 0), dtype="float")   # reset matrix

                    block_count += 1
                elif i == 2:
                    self.timestamp.append(values[0])
                elif i == 3 or i == 5:
                    continue
                elif i == 4:
                    if self.atom_no != 0 and self.atom_no != values[0]:
                        self.warning_message = "atom number in timestep " + self.timestamp[-1] + "is inconsistent with" + self.timestamp[-2]
                        config.ConfigureUserEnv.log(self.warning_message)
                    else:
                        pass
                    self.atom_no = values[0]
                elif i == 6 or i == 7 or i == 8:
                    self.box.append(values[0])
                    self.box.append(values[1])
                elif i == self.head_line_no:
                    values = line.rstrip().rsplit(":")
                    for j in range(1,len(values)):
                        self.column_name.append(values[j])
            else:
                if self.matrix.size != 0:
                    np_array = np.array(values)
                    self.matrix = np.append(self.matrix, np.array(np.asarray(values)), 0)     
                else:
                    np_array = np.array(values)
                    self.matrix = np.zeros((1,len(values)), dtype="float")
                    self.matrix = np.asarray(values)
        dump.close()
        print(self.matrix)       # for test
        print(self.matrix.size)  # for test

Original data like below:原始数据如下:

ITEM: TIMESTEP
100
ITEM: NUMBER OF ATOMS
17587
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 4.3491000000000000e+01
0.0000000000000000e+00 4.3491000000000000e+01
0.0000000000000000e+00 1.2994000000000000e+02
ITEM: ATOMS id type q xs ys zs 
59 1 1.80278 0.110598 0.129682 0.0359397 
297 1 1.14132 0.139569 0.0496654 0.00692627 
315 1 1.17041 0.0832356 0.00620818 0.00507927 
509 1 1.67165 0.0420777 0.113817 0.0313991 
590 1 1.65209 0.114966 0.0630015 0.0447129 
731 1 1.65143 0.0501253 0.13658 0.0108512 
1333 2 1.049 0.00850751 0.0526546 0.0406341 
...... 

I hope to add matrix data like below:我希望添加如下矩阵数据:

matrix = [[59 1 1.80278 0.110598 0.129682 0.0359397],
[297 1 1.14132 0.139569 0.0496654 0.00692627],
[315 1 1.17041 0.0832356 0.00620818 0.00507927],
...]

As mentioned above, there are very big size of datasets.如上所述,数据集的规模非常大。 I hope to use the fastest way to append array to the matrix.我希望使用最快的方式将数组附加到矩阵。 Any further help and advice would be highly appreciated.任何进一步的帮助和建议将不胜感激。

Here are some important point to speed up the computation:以下是加快计算速度的一些要点:

  • Do not use self.matrix = np.append(self.matrix, ...) in a loop , this is not efficient as it recreate a new growing array for each iteration (and copy the old one).不要在循环中使用self.matrix = np.append(self.matrix, ...) ,这效率不高,因为它为每次迭代重新创建一个新的增长数组(并复制旧数组)。 This result in a quadratic run time .这导致二次运行时间 Use a pure-Python list instead with append and convert the list to a Numpy array in the end.使用纯 Python 列表代替append ,最后将列表转换为 Numpy 数组。 This is the most critical performance-wise point .这是最关键的性能点
  • Using self.box.extend((values[0], values[1])) should be significantly faster than performing two append .使用self.box.extend((values[0], values[1]))应该比执行两个append快得多。
  • Using dtype="float" is not very clear not very efficient, please consider using dtype=np.float64 instead (that do not need to be parsed by Numpy).使用dtype="float"不是很清楚也不是很有效,请考虑使用dtype=np.float64代替(不需要由 Numpy 解析)。
  • Using enumerate may be a bit faster than a manual increment in the loop.使用enumerate可能比循环中的手动增量快一点。
  • Cython may help you to speed up this program if this is not fast enough for your input file.如果这对于您的输入文件来说不够快, Cython可以帮助您加速这个程序。 One should keep in mind that the standard Python interpreter (CPython) is not very fast to parse complex huge files compared to compiled native programs/modules written in languages like C or C++.应该记住,与使用 C 或 C++ 等语言编写的已编译本机程序/模块相比,标准 Python 解释器 (CPython) 解析复杂的大文件的速度不是很快。

Note that values[i] are strings and so self.timestamp and self.box .请注意, values[i]字符串,因此self.timestampself.box Aren't they supposed to be integers/floats?它们不应该是整数/浮点数吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM