[英]How to append python list to a numpy matrix in fastest way?
I am writing a code to read research data which have up to billion lines.我正在编写代码来读取多达十亿行的研究数据。 I have to read data line by line because the data have multiple blocks.我必须逐行读取数据,因为数据有多个块。 Each block has headers which are different from other block headers and datasets.每个块都有不同于其他块头和数据集的头。 I hope to read those datasets into a Numpy matrix so I can perform matrix operations.我希望将这些数据集读入 Numpy 矩阵,以便我可以执行矩阵运算。 Here are essential codes.以下是基本代码。
with open(datafile, "r") as dump:
i = 0 # block line number
line_no = 0 # total line number
block_size = 0
block_count = 0
for line in dump:
values = line.rstrip().rsplit()
i += 1
line_no += 1
if i <= self.head_line_no:
print(line) # for test
if self.tag_block in line or i == 1: # 1st line of a block
# save block size after reading 1st block
if block_size == 0 and block_count == 0:
block_size = line_no - 1
i = 1 # reset block line number
self.box = [] # reset box constant
print(self.matrix)
self.matrix = np.zeros((0, 0), dtype="float") # reset matrix
block_count += 1
elif i == 2:
self.timestamp.append(values[0])
elif i == 3 or i == 5:
continue
elif i == 4:
if self.atom_no != 0 and self.atom_no != values[0]:
self.warning_message = "atom number in timestep " + self.timestamp[-1] + "is inconsistent with" + self.timestamp[-2]
config.ConfigureUserEnv.log(self.warning_message)
else:
pass
self.atom_no = values[0]
elif i == 6 or i == 7 or i == 8:
self.box.append(values[0])
self.box.append(values[1])
elif i == self.head_line_no:
values = line.rstrip().rsplit(":")
for j in range(1,len(values)):
self.column_name.append(values[j])
else:
if self.matrix.size != 0:
np_array = np.array(values)
self.matrix = np.append(self.matrix, np.array(np.asarray(values)), 0)
else:
np_array = np.array(values)
self.matrix = np.zeros((1,len(values)), dtype="float")
self.matrix = np.asarray(values)
dump.close()
print(self.matrix) # for test
print(self.matrix.size) # for test
Original data like below:原始数据如下:
ITEM: TIMESTEP
100
ITEM: NUMBER OF ATOMS
17587
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 4.3491000000000000e+01
0.0000000000000000e+00 4.3491000000000000e+01
0.0000000000000000e+00 1.2994000000000000e+02
ITEM: ATOMS id type q xs ys zs
59 1 1.80278 0.110598 0.129682 0.0359397
297 1 1.14132 0.139569 0.0496654 0.00692627
315 1 1.17041 0.0832356 0.00620818 0.00507927
509 1 1.67165 0.0420777 0.113817 0.0313991
590 1 1.65209 0.114966 0.0630015 0.0447129
731 1 1.65143 0.0501253 0.13658 0.0108512
1333 2 1.049 0.00850751 0.0526546 0.0406341
......
I hope to add matrix data like below:我希望添加如下矩阵数据:
matrix = [[59 1 1.80278 0.110598 0.129682 0.0359397],
[297 1 1.14132 0.139569 0.0496654 0.00692627],
[315 1 1.17041 0.0832356 0.00620818 0.00507927],
...]
As mentioned above, there are very big size of datasets.如上所述,数据集的规模非常大。 I hope to use the fastest way to append array to the matrix.我希望使用最快的方式将数组附加到矩阵。 Any further help and advice would be highly appreciated.任何进一步的帮助和建议将不胜感激。
Here are some important point to speed up the computation:以下是加快计算速度的一些要点:
self.matrix = np.append(self.matrix, ...)
in a loop , this is not efficient as it recreate a new growing array for each iteration (and copy the old one).不要在循环中使用self.matrix = np.append(self.matrix, ...)
,这效率不高,因为它为每次迭代重新创建一个新的增长数组(并复制旧数组)。 This result in a quadratic run time .这导致二次运行时间。 Use a pure-Python list instead with append
and convert the list to a Numpy array in the end.使用纯 Python 列表代替append
,最后将列表转换为 Numpy 数组。 This is the most critical performance-wise point .这是最关键的性能点。self.box.extend((values[0], values[1]))
should be significantly faster than performing two append
.使用self.box.extend((values[0], values[1]))
应该比执行两个append
快得多。dtype="float"
is not very clear not very efficient, please consider using dtype=np.float64
instead (that do not need to be parsed by Numpy).使用dtype="float"
不是很清楚也不是很有效,请考虑使用dtype=np.float64
代替(不需要由 Numpy 解析)。enumerate
may be a bit faster than a manual increment in the loop.使用enumerate
可能比循环中的手动增量快一点。 Note that values[i]
are strings and so self.timestamp
and self.box
.请注意, values[i]
是字符串,因此self.timestamp
和self.box
。 Aren't they supposed to be integers/floats?它们不应该是整数/浮点数吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.