如何以有序方式正确存储大量数据？

Question

I have a huge amount of data that I need to organize. 我需要整理大量的数据。 I think one of the good solutions might be to use a pandas DF and then to pickle the DF to save it. 我认为一种好的解决方案可能是使用熊猫DF，然后腌制DF以保存它。 Only problem, I still don't understand how pandas works, and I don't know what approach should be considered in regard of my data. 唯一的问题，我仍然不了解熊猫是如何工作的，也不知道应该使用哪种方法处理我的数据。

Data: list of the form ((int, int), float, float) but the first tuple of integer can be of different size. 数据：形式为((int, int), float, float)但整数的第一个元组可以具有不同的大小。

Example 1: 范例1：

[((78, 104), 1.55, 0.25),
((78, 104), 1.56, 0.25),
((78, 104), 1.57, 0.25),
((78, 104), 1.58, 0.25),
((75, 100), 5.02, 0.25),
((75, 100), 5.03, 0.25),
((75, 100), 5.04, 0.25),
((75, 100), 5.05, 0.25),
((78, 104), 1.25, 0.333),
((78, 104), 1.26, 0.333)]

Example 2: 范例2：

[((20, 78, 104), 1.55, 0.25),
((20, 78, 104), 1.56, 0.25),
((21, 78, 104), 1.57, 0.25),
((21, 78, 104), 1.58, 0.25),
((18, 75, 100), 5.02, 0.25),
((18, 75, 100), 5.03, 0.25),
((18, 75, 100), 5.04, 0.25),
((18, 75, 100), 5.05, 0.25),
((20, 78, 104), 1.25, 0.333),
((20, 78, 104), 1.26, 0.333)]

These are just extracts. 这些只是摘录。 At the moment I wrote the data in .txt files and then parse it back with string methods when I read a file. 目前，我将数据写入.txt文件，然后在读取文件时使用字符串方法将其解析回去。 As you can see, the tuple can be common to a lot of value (hundreds or thousands) and have a len of 2 or more. 如您所见，元组可以具有很多值（数百或数千），并且具有2或更大的len。

Another way to represent this data is a dictionary of the form: Dict[tuple] = ([list of column 1], [list of column 2]) : 表示此数据的另一种方法是以下形式的Dict[tuple] = ([list of column 1], [list of column 2]) ： Dict[tuple] = ([list of column 1], [list of column 2]) ：

data_dict = dict()
for elt in data_list:
    if elt[0] not in data_dict.keys():
        data_dict[elt[0]] = ([elt[1]], [elt[2]])
    else:
        data_dict[elt[0]][0].append(elt[1])
        data_dict[elt[0]][1].append(elt[2])

Example 1: 范例1：

data_list = [((78, 104), 1.55, 0.25), ((78, 104), 1.56, 0.25), ((78, 104), 1.57, 0.25), ((78, 104), 1.58, 0.25),
((75, 100), 5.02, 0.25), ((75, 100), 5.03, 0.25), ((75, 100), 5.04, 0.25), ((75, 100), 5.05, 0.25), ((78, 104), 1.25, 0.333),
((78, 104), 1.26, 0.333)]

Output:
{(78, 104): ([1.55, 1.56, 1.57, 1.58, 1.25, 1.26], [0.25, 0.25, 0.25, 0.25, 0.333, 0.333]), 
(75, 100): ([5.02, 5.03, 5.04, 5.05], [0.25, 0.25, 0.25, 0.25])}

Example 2: 范例2：

data_list = [((20, 78, 104), 1.55, 0.25), ((20, 78, 104), 1.56, 0.25), ((21, 78, 104), 1.57, 0.25), ((21, 78, 104), 1.58, 0.25),
((18, 75, 100), 5.02, 0.25), ((18, 75, 100), 5.03, 0.25), ((18, 75, 100), 5.04, 0.25), ((18, 75, 100), 5.05, 0.25), ((20, 78, 104), 1.25, 0.333),
((20, 78, 104), 1.26, 0.333)]

Output:
{(20, 78, 104): ([1.55, 1.56, 1.25, 1.26], [0.25, 0.25, 0.333, 0.333]), 
(21, 78, 104): ([1.57, 1.58], [0.25, 0.25]), 
(18, 75, 100): ([5.02, 5.03, 5.04, 5.05], [0.25, 0.25, 0.25, 0.25])}

What are the possible way (and what is the best one) to store this data in a file, and to access it in an efficient way? 有什么可能的方法（最好的方法是什么）将这些数据存储在文件中，并以有效的方式访问它？

A combination is: ((78, 104), 1.57, 0.25) For the access part, I will, for instance, need to look in file 1 for combinations using the tuple (78, 104) and then in file 2 for the combinations using the tuple (20, 78). 组合为： ((78, 104), 1.57, 0.25)对于访问部分，例如，我需要使用元组（78，104）在文件1中查找组合，然后在文件2中查找组合使用元组（20，78）。 The final goal is to find every matching combination, ie the 2 values after (78, 104) in file 1 must be the same as the 2 values after (20, 78) in file 2. Thus I need to access quickly the interesting combinations of each file. 最终目标是找到每个匹配的组合，即文件1中（78，104）之后的2个值必须与文件2中（20，78）之后的2个值相同。因此，我需要快速访问有趣的组合每个文件。

Thanks for any advice, code, and help on how to represent this data and to store it. 感谢您提供有关如何表示和存储数据的任何建议，代码和帮助。

If you ask for, I can put the code for the store / read of .txt file, but I think we will all agree that this is not the best approach for this problem. 如果您需要，我可以放置用于存储/读取.txt文件的代码，但是我认为我们都同意这不是解决此问题的最佳方法。

Answer 1

import pandas as pd
ex_first = [((78, 104), 1.55, 0.25),
    ((78, 104), 1.56, 0.25),
    ((78, 104), 1.57, 0.25),
        ((78, 104), 1.58, 0.25),
    ((75, 100), 5.02, 0.25),
    ((75, 100), 5.03, 0.25),
    ((75, 100), 5.04, 0.25),
    ((75, 100), 5.05, 0.25),
    ((78, 104), 1.25, 0.333),
    ((78, 104), 1.26, 0.333)]

data = pd.DataFrame(ex_first)
data.columns = ["tuple1", "name1", "name2"]
# saving
data.to_csv("DataFrame.csv")

If you want to retrieve the values that tuple is "(78, 104)" you simple look up through: 如果要检索元组为“（（78，104）”）的值，则可以简单地查找以下内容：

In [67]: results = data[data.tuple1==(78, 104)]

In [68]: results
Out[68]: 
  tuple1  name1  name2
0  (78, 104)   1.55  0.250
1  (78, 104)   1.56  0.250
2  (78, 104)   1.57  0.250
3  (78, 104)   1.58  0.250
8  (78, 104)   1.25  0.333
9  (78, 104)   1.26  0.333

And if the csv file is big -> read how to use chunks 如果csv文件很大-> 阅读如何使用块

如何以有序方式正确存储大量数据？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-03-06 11:08:39

如何以有序方式正确存储大量数据？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-03-06 11:08:39

解决方案1
1 已采纳 2018-03-06 11:08:39