简体   繁体   English

如何以有序方式正确存储大量数据?

[英]How to store properly huge amount of data in an ordered way?

I have a huge amount of data that I need to organize. 我需要整理大量的数据。 I think one of the good solutions might be to use a pandas DF and then to pickle the DF to save it. 我认为一种好的解决方案可能是使用熊猫DF,然后腌制DF以保存它。 Only problem, I still don't understand how pandas works, and I don't know what approach should be considered in regard of my data. 唯一的问题,我仍然不了解熊猫是如何工作的,也不知道应该使用哪种方法处理我的数据。

Data: list of the form ((int, int), float, float) but the first tuple of integer can be of different size. 数据:形式为((int, int), float, float)但整数的第一个元组可以具有不同的大小。

Example 1: 范例1:

[((78, 104), 1.55, 0.25),
((78, 104), 1.56, 0.25),
((78, 104), 1.57, 0.25),
((78, 104), 1.58, 0.25),
((75, 100), 5.02, 0.25),
((75, 100), 5.03, 0.25),
((75, 100), 5.04, 0.25),
((75, 100), 5.05, 0.25),
((78, 104), 1.25, 0.333),
((78, 104), 1.26, 0.333)]

Example 2: 范例2:

[((20, 78, 104), 1.55, 0.25),
((20, 78, 104), 1.56, 0.25),
((21, 78, 104), 1.57, 0.25),
((21, 78, 104), 1.58, 0.25),
((18, 75, 100), 5.02, 0.25),
((18, 75, 100), 5.03, 0.25),
((18, 75, 100), 5.04, 0.25),
((18, 75, 100), 5.05, 0.25),
((20, 78, 104), 1.25, 0.333),
((20, 78, 104), 1.26, 0.333)]

These are just extracts. 这些只是摘录。 At the moment I wrote the data in .txt files and then parse it back with string methods when I read a file. 目前,我将数据写入.txt文件,然后在读取文件时使用字符串方法将其解析回去。 As you can see, the tuple can be common to a lot of value (hundreds or thousands) and have a len of 2 or more. 如您所见,元组可以具有很多值(数百或数千),并且具有2或更大的len。

Another way to represent this data is a dictionary of the form: Dict[tuple] = ([list of column 1], [list of column 2]) : 表示此数据的另一种方法是以下形式的Dict[tuple] = ([list of column 1], [list of column 2])Dict[tuple] = ([list of column 1], [list of column 2])

data_dict = dict()
for elt in data_list:
    if elt[0] not in data_dict.keys():
        data_dict[elt[0]] = ([elt[1]], [elt[2]])
    else:
        data_dict[elt[0]][0].append(elt[1])
        data_dict[elt[0]][1].append(elt[2])

Example 1: 范例1:

data_list = [((78, 104), 1.55, 0.25), ((78, 104), 1.56, 0.25), ((78, 104), 1.57, 0.25), ((78, 104), 1.58, 0.25),
((75, 100), 5.02, 0.25), ((75, 100), 5.03, 0.25), ((75, 100), 5.04, 0.25), ((75, 100), 5.05, 0.25), ((78, 104), 1.25, 0.333),
((78, 104), 1.26, 0.333)]

Output:
{(78, 104): ([1.55, 1.56, 1.57, 1.58, 1.25, 1.26], [0.25, 0.25, 0.25, 0.25, 0.333, 0.333]), 
(75, 100): ([5.02, 5.03, 5.04, 5.05], [0.25, 0.25, 0.25, 0.25])}

Example 2: 范例2:

data_list = [((20, 78, 104), 1.55, 0.25), ((20, 78, 104), 1.56, 0.25), ((21, 78, 104), 1.57, 0.25), ((21, 78, 104), 1.58, 0.25),
((18, 75, 100), 5.02, 0.25), ((18, 75, 100), 5.03, 0.25), ((18, 75, 100), 5.04, 0.25), ((18, 75, 100), 5.05, 0.25), ((20, 78, 104), 1.25, 0.333),
((20, 78, 104), 1.26, 0.333)]

Output:
{(20, 78, 104): ([1.55, 1.56, 1.25, 1.26], [0.25, 0.25, 0.333, 0.333]), 
(21, 78, 104): ([1.57, 1.58], [0.25, 0.25]), 
(18, 75, 100): ([5.02, 5.03, 5.04, 5.05], [0.25, 0.25, 0.25, 0.25])}

What are the possible way (and what is the best one) to store this data in a file, and to access it in an efficient way? 有什么可能的方法(最好的方法是什么)将这些数据存储在文件中,并以有效的方式访问它?

A combination is: ((78, 104), 1.57, 0.25) For the access part, I will, for instance, need to look in file 1 for combinations using the tuple (78, 104) and then in file 2 for the combinations using the tuple (20, 78). 组合为: ((78, 104), 1.57, 0.25)对于访问部分,例如,我需要使用元组(78,104)在文件1中查找组合,然后在文件2中查找组合使用元组(20,78)。 The final goal is to find every matching combination, ie the 2 values after (78, 104) in file 1 must be the same as the 2 values after (20, 78) in file 2. Thus I need to access quickly the interesting combinations of each file. 最终目标是找到每个匹配的组合,即文件1中(78,104)之后的2个值必须与文件2中(20,78)之后的2个值相同。因此,我需要快速访问有趣的组合每个文件。

Thanks for any advice, code, and help on how to represent this data and to store it. 感谢您提供有关如何表示和存储数据的任何建议,代码和帮助。

If you ask for, I can put the code for the store / read of .txt file, but I think we will all agree that this is not the best approach for this problem. 如果您需要,我可以放置用于存储/读取.txt文件的代码,但是我认为我们都同意这不是解决此问题的最佳方法。

import pandas as pd
ex_first = [((78, 104), 1.55, 0.25),
    ((78, 104), 1.56, 0.25),
    ((78, 104), 1.57, 0.25),
        ((78, 104), 1.58, 0.25),
    ((75, 100), 5.02, 0.25),
    ((75, 100), 5.03, 0.25),
    ((75, 100), 5.04, 0.25),
    ((75, 100), 5.05, 0.25),
    ((78, 104), 1.25, 0.333),
    ((78, 104), 1.26, 0.333)]

data = pd.DataFrame(ex_first)
data.columns = ["tuple1", "name1", "name2"]
# saving
data.to_csv("DataFrame.csv")

If you want to retrieve the values that tuple is "(78, 104)" you simple look up through: 如果要检索元组为“((78,104)”)的值,则可以简单地查找以下内容:

In [67]: results = data[data.tuple1==(78, 104)]

In [68]: results
Out[68]: 
  tuple1  name1  name2
0  (78, 104)   1.55  0.250
1  (78, 104)   1.56  0.250
2  (78, 104)   1.57  0.250
3  (78, 104)   1.58  0.250
8  (78, 104)   1.25  0.333
9  (78, 104)   1.26  0.333

And if the csv file is big -> read how to use chunks 如果csv文件很大-> 阅读如何使用块

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM