简体   繁体   English

大型嵌套列表与字典

[英]large nested lists versus dictionaries

Please could I solicit some general advice regarding Python lists. 请问有关Python列表的一些一般性建议。 I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path. 我知道我不应该在这里回答“开放式”问题,但我担心会完全走上错误的道路。

My problem is that I have .csv files that are approximately 600,000 lines long each. 我的问题是我有.csv文件,每个文件长约60万行。 Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. .csv的每一行都有6个字段,其中第一个字段是格式为DD / MM / YYYY HH:MM:SS的日期时间戳。 The next two fields are blank and the last three fields contain float and integer values, so for example: 接下来的两个字段为空白,最后三个字段包含float和integer值,例如:

23/05/2017 16:42:17,  ,   , 1.25545, 1.74733, 12 
23/05/2017 16:42:20,  ,   , 1.93741, 1.52387, 14 
23/05/2017 16:42:23,  ,   , 1.54875, 1.46258, 11

etc 等等

No two values in column 1 (date-time stamp) will ever be the same. 第1列(日期时间戳)中的两个值都不会相同。

I need to write a program that will do a few basic operations with the data, such as: 我需要编写一个程序来对数据进行一些基本操作,例如:

  1. read all of the data into a dictionary, list, set (?) etc as appropriate. 将所有数据读入字典,列表,设置(?)等(视情况而定)。
  2. search through the date time stamp column for a particular value. 在日期时间戳列中搜索特定值。
  3. read through the list and do basic calculations on the floats in columns 4 and 5. 通读列表,并对第4列和第5列中的浮点数进行基本计算。
  4. write a new list based on the searches/calculations. 根据搜索/计算结果写一个新列表。

My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset? 我的问题是-我应该如何“处理”数据,由于数据集的长度,我是否很可能会遇到问题?

For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? 例如,是否应该将所有数据导入列表,并且列表的每个元素都是每个行数据的子列表? Eg: 例如:

[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]

Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, eg: 还是最好使每个日期时间戳成为字典中的“键”,并使字典“值”与所有其他值一起成为列表,例如:

{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...} etc {'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}

If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps? 如果我使用列表方法,是否有一种方法可以让Python仅在第一列中“搜索”特定的时间戳,而不是在我们知道只有第一列包含时间戳的情况下使其遍历600,000行乘以6列?

I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer. 如果查询有点含糊,我深表歉意,但不胜感激任何人都可以提供的指导。

600000 lines aren't that many, your script should run fine with either a list or a dict. 600000行不是很多,您的脚本可以通过列表或字典正常运行。

As a test, let's use: 作为测试,让我们使用:

data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]

dict 字典

If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. 如果您要查找确切的时间戳,那么使用字典查找要比使用列表查找要快得多。 You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18" . 您必须确切地知道要查找的内容: "23/05/2017 16:42:17"具有与"23/05/2017 16:42:18"完全不同的哈希值。

data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}

print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]

print(data_as_dict.get('2017-05-17 14:01:10'))
# None

Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. 请注意,您的DD/MM/YYYY HH:MM:SS格式不太方便:按字典顺序对单元格进行排序不会按日期时间对其进行排序。 You'd need to use datetime.strptime() first: 您需要先使用datetime.strptime()

from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}    
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]

print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None

list with binary search 二进制搜索列表

If you're looking for timestamps ranges, a dict won't help you much. 如果您正在寻找时间戳范围,那么字典将无济于事。 A binary search (eg with bisect ) on a list of timestamps should be very fast. 在时间戳列表上的二进制搜索(例如,使用bisect )应该非常快。

import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]

Database 数据库

Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries. 重新发明轮子之前,您可能需要将所有CSV转储到小型数据库(sqlite,Postgresql等)中,并使用相应的查询。

Pandas 熊猫

If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame . 如果您不想增加数据库的复杂性,但准备花一些时间学习新语法,则应该使用pandas.DataFrame It does exactly what you want, and then some. 它确实可以完成您想要的,然后执行一些操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM