繁体   English   中英

python中的大词典超出了RAM的容量

[英]large dictionary in python exceed RAM capacity

我对Python的功能有疑问。 我有一个非常大的数据集(200 GB),我将使用python遍历行,将数据存储在字典中,然后执行一些计算。 最后,我将计算出的数据写入CSV文件。 我关心的是计算机的容量。 恐怕(或可以肯定)我的RAM无法存储该大数据集。 有没有更好的办法? 这是输入数据的结构:

#RIC    Date[L] Time[L] Type    ALP-L1-BidPrice ALP-L1-BidSize  ALP-L1-AskPrice ALP-L1-AskSize  ALP-L2-BidPrice ALP-L2-BidSize  ALP-L2-AskPrice ALP-L2-AskSize  ALP-L3-BidPrice ALP-L3-BidSize  ALP-L3-AskPrice ALP-L3-AskSize  ALP-L4-BidPrice ALP-L4-BidSize  ALP-L4-AskPrice ALP-L4-AskSize  ALP-L5-BidPrice ALP-L5-BidSize  ALP-L5-AskPrice ALP-L5-AskSize  TOR-L1-BidPrice TOR-L1-BidSize  TOR-L1-AskPrice TOR-L1-AskSize  TOR-L2-BidPrice TOR-L2-BidSize  TOR-L2-AskPrice TOR-L2-AskSize  TOR-L3-BidPrice TOR-L3-BidSize  TOR-L3-AskPrice TOR-L3-AskSize  TOR-L4-BidPrice TOR-L4-BidSize  TOR-L4-AskPrice TOR-L4-AskSize  TOR-L5-BidPrice TOR-L5-BidSize  TOR-L5-AskPrice TOR-L5-AskSize
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 16000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 22000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 40000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:10.8 Market Depth    5.29    50000   5.3 44000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 32000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 46000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000
HOU.ALP 20150901    30:12.1 Market Depth    5.29    50000   5.3 38000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000   5.29    50000   5.3 36000   5.28    50000   5.31    50000   5.27    50000   5.32    50000   5.26    50000   5.33    50000           5.34    50000

这是我尝试做的事情:1.读取ta数据并将其存储在带有键[symbol] [time] [bid]和[ask]等的词典中2.在任何时间点,找到最佳出价,然后最佳要价(这需要水平/在我不知道的键中的值中排序),因为要价和要价来自不同的交易所,我们需要找到最佳价格并将其从最佳到最差进行排序以及该特定价格的交易量。 3.导出到一个csv文件。

这是我对密码的尝试。 请帮助我更有效地编写它:

# this file calculate the depth up to $50,000

import csv
from math import ceil
from collections import defaultdict

# open csv file
csv_file = open('2016_01_04-data_3_stocks.csv', 'rU')
reader = csv.DictReader(csv_file)

# Set variables:
date = None
exchange_depth = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
effective_spread = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
time_bucket = [i * 100000.0 for i in range(0, 57600000000 / 100000)]

# Set functions
def time_to_milli(times):
    hours = float(times.split(':')[0]) * 60 * 60 * 1000000
    minutes = float(times.split(':')[1]) * 60 * 1000000
    seconds = float(times.split(':')[2]) * 1000000
    milliseconds = float(times.split('.')[1])
    timestamp = hours + minutes + seconds + milliseconds
    return timestamp


# Extract data
for i in reader:
    if not bool(date):
        date = i['Date[L]'][0:4] + "-" + i['Date[L]'][4:6] + "-" + i['Date[L]'][6:8]
    security = i['#RIC'].split('.')[0]
    exchange = i['#RIC'].split('.')[1]
    timestamp = float(time_to_milli(i['Time[L]']))
    bucket = ceil(float(time_to_milli(i['Time[L]'])) / 100000.0) * 100000.0
    # input bid price and bid size
    exchange_depth[security][bucket][Bid][i['ALP-L1-BidPrice']] += i['ALP-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L2-BidPrice']] += i['ALP-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L3-BidPrice']] += i['ALP-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L4-BidPrice']] += i['ALP-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['ALP-L5-BidPrice']] += i['ALP-L5-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L1-BidPrice']] += i['TOR-L1-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L2-BidPrice']] += i['TOR-L2-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L3-BidPrice']] += i['TOR-L3-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L4-BidPrice']] += i['TOR-L4-BidSize']
    exchange_depth[security][bucket][Bid][i['TOR-L5-BidPrice']] += i['TOR-L5-BidSize']
    # input ask price and ask size
    exchange_depth[security][bucket][Ask][i['ALP-L1-AskPrice']] += i['ALP-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L2-AskPrice']] += i['ALP-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L3-AskPrice']] += i['ALP-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L4-AskPrice']] += i['ALP-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['ALP-L5-AskPrice']] += i['ALP-L5-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L1-AskPrice']] += i['TOR-L1-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L2-AskPrice']] += i['TOR-L2-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L3-AskPrice']] += i['TOR-L3-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L4-AskPrice']] += i['TOR-L4-AskSize']
    exchange_depth[security][bucket][Ask][i['TOR-L5-AskPrice']] += i['TOR-L5-AskSize']
# Now rank bid price and ask price among exchange_depth[security][bucket][Bid] and exchange_depth[security][bucket][Ask] keys
    #I don't know how to do this

根据您告诉我们的内容,您可以执行以下操作:

import csv
with open("path/to/my_dataset", 'r') as input_f, open("output.csv", 'a') as output_f:
    # Keep reading lines from input data until you run out
    for line in f:
        # do processing and add to processed
        processed = []

        # write processed data to output file
        csv.writer(output_f).writerow(processed)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM