简体   繁体   English

Python - 使用HUGE数据集避免内存错误

[英]Python - avoiding memory error with HUGE data set

I have a python program that connects to a PostGreSQL database. 我有一个连接到PostGreSQL数据库的python程序。 In this database I have quite a lot of data (around 1.2 billion rows). 在这个数据库中,我有很多数据(大约12亿行)。 Luckily I don't have to analyse all of those rows at the same time. 幸运的是,我不必同时分析所有这些行。

Those 1.2 billion rows are spread on several tables (around 30). 这12亿行分布在几张桌子上(大约30张)。 Currently I am accessing a table called table_3, in which I want to access all the rows that has a specific "did" value (as the column is called). 目前我正在访问一个名为table_3的表,我想在其中访问具有特定“did”值的所有行(如调用该列)。

I have counted the rows using a SQL command: 我使用SQL命令计算了行数:

SELECT count(*) FROM table_3 WHERE did='356002062376054';

which returns with 157 million rows. 返回1.57亿行。

I will perform some "analysis" on all of these rows (extracting 2 specific values) and doing some calculations on these values, followed by writing them to a dictionary and then save them back on the PostGreSQL in a different table. 我将对所有这些行执行一些“分析”(提取2个特定值)并对这些值进行一些计算,然后将它们写入字典,然后将它们保存在另一个表中的PostGreSQL上。

The problem is I'm am creating a lot of lists and dictionaries in managing all this I end up running out of memory even though I am using Python 3 64 bit and have 64 GB of RAM. 问题是我正在创建大量列表和字典来管理所有这些我最终耗尽内存,即使我使用的是Python 3 64位并且具有64 GB的RAM。

Some code: 一些代码:

CONNECTION = psycopg2.connect('<psycopg2 formatted string>')
CURSOR = CONNECTION.cursor()

DID_LIST = ["357139052424715",
            "353224061929963",
            "356002064810514",
            "356002064810183",
            "358188051768472",
            "358188050598029",
            "356002061925067",
            "358188056470108",
            "356002062376054",
            "357460064130045"]

SENSOR_LIST = [1, 2, 3, 4, 5, 6, 7, 8, 9,
               10, 11, 12, 13, 801, 900, 901,
               902, 903, 904, 905, 906, 907,
               908, 909, 910, 911]

for did in did_list:
    table_name = did
    for sensor_id in sensor_list:
        rows = get_data(did, sensor_id)
        list_object = create_standard_list(sensor_id, rows)  # Happens here
        formatted_list = format_table_dictionary(list_object) # Or here
        pushed_rows = write_to_table(table_name, formatted_list) #write_to_table method is omitted as that is not my problem.

def get_data(did, table_id):
    """Getting data from postgresql."""
    table_name = "table_{0}".format(table_id)
    query = """SELECT * FROM {0} WHERE did='{1}'
               ORDER BY timestamp""".format(table_name, did)

    CURSOR.execute(query)
    CONNECTION.commit()

    return CURSOR

def create_standard_list(sensor_id, data):
    """Formats DB data to dictionary"""
    list_object = []

    print("Create standard list")
    for row in data: # data is the psycopg2 CURSOR
        row_timestamp = row[2]
        row_data = row[3]

        temp_object = {"sensor_id": sensor_id, "timestamp": row_timestamp,
                       "data": row_data}

        list_object.append(temp_object)

    return list_object


def format_table_dictionary(list_dict):
    """Formats dictionary to simple data
       table_name = (dates, data_count, first row)"""
    print("Formatting dict to DB")
    temp_today = 0
    dict_list = []
    first_row = {}
    count = 1

    for elem in list_dict:
        # convert to seconds
        date = datetime.fromtimestamp(elem['timestamp'] / 1000)
        today = int(date.strftime('%d'))
        if temp_today is not today:
            if not first_row:
                first_row = elem['data']
            first_row_str = str(first_row)
            dict_object = {"sensor_id": elem['sensor_id'],
                           "date": date.strftime('%d/%m-%Y'),
                           "reading_count": count,
                           # size in MB of data
                           "approx_data_size": (count*len(first_row_str)/1000),
                           "time": date.strftime('%H:%M:%S'),
                           "first_row": first_row}

            dict_list.append(dict_object)
            first_row = {}
            temp_today = today
            count = 0
        else:
            count += 1

    return dict_list

My error happens somewhere around creating either of the two lists as marked with comments in my code. 我的错误发生在创建两个列表中的任何一个,在我的代码中用注释标记。 And it represents with my computer stopping responding, and eventually logging me out. 它代表我的电脑停止响应,并最终让我退出。 I am running windows 10 if that is some importance. 我正在运行Windows 10,如果这是重要的。

I know the first list I create with the "create_standard_list" method could be excluded and that code could be run in the "format_table_dictionary" code, and thereby avoid a list with 157 mio element in memory, but I think that some of the other tables that I will run into will have similar problems and might be even larger, so I thought of optimizing it all right now, but I am unsure of what I could do? 我知道我使用“create_standard_list”方法创建的第一个列表可以被排除,并且该代码可以在“format_table_dictionary”代码中运行,从而避免在内存中包含157 mio元素的列表,但我认为其他一些表我将遇到类似的问题,可能会更大,所以我想现在就优化它,但我不确定我能做什么?

I guess writing to a file wouldn't really help a whole lot as I would have to read that file and thereby putting it back into memory all again? 我想写一个文件并不会真正有用,因为我必须读取该文件,从而将它重新放回内存中?

Minimalist example 极简主义的例子

I have a table 我有一张桌子

---------------------------------------------------------------
|Row 1 | did | timestamp | data | unused value | unused value |
|Row 2 | did | timestamp | data | unused value | unused value |
....
---------------------------------

table = [{ values from above row1 }, { values from above row2},...]

connection = psycopg2.connect(<connection string>)
cursor = connection.cursor()

table = cursor.execute("""SELECT * FROM table_3 WHERE did='356002062376054'
                          ORDER BY timestamp""")

extracted_list = extract(table)
calculated_list = calculate(extracted_list)
... write to db ...

def extract(table):
    """extract all but unused values"""
    new_list = []
    for row in table:
        did = row[0]
        timestamp = row[1]
        data = row[2]

        a_dict = {'did': did, 'timestamp': timestamp, 'data': data}
        new_list.append(a_dict)

    return new_list


def calculate(a_list):
    """perform calculations on values"""
    dict_list = []
    temp_today = 0
    count = 0
    for row in a_list:
        date = datetime.fromtimestamp(row['timestamp'] / 1000) # from ms to sec
        today = int(date.strfime('%d'))
        if temp_today is not today:
            new_dict = {'date': date.strftime('%d/%m-%Y'),
                        'reading_count': count,
                        'time': date.strftime('%H:%M:%S')}
            dict_list.append(new_dict)

    return dict_list

create_standard_list() and format_table_dictionary() could build generators ( yield ing each item instead of return ing the full lists), this stops holding the whole lists in memory and so should solve your issue, for example: create_standard_list()format_table_dictionary()可以构建生成器( yield每个项目而不是return完整列表),这会停止将整个列表保存在内存中,因此应该解决您的问题,例如:

def create_standard_list(sensor_id, data):
    for row in data:
        row_timestamp = row[2]
        row_data = row[3]

        temp_object = {"sensor_id": sensor_id, "timestamp": row_timestamp,
                       "data": row_data}
        yield temp_object
       #^ yield each item instead of appending to a list

Further information on generators and the yield keyword . 有关生成器yield关键字的更多信息。

What you are trying to do here, IIUC, is to emulate an SQL GROUP BY expression in Python code. 您在这里尝试做的是IIUC,它是在Python代码中模拟SQL GROUP BY表达式。 This can never be as quick and memory as efficient as doing it directly in the database. 这永远不会像直接在数据库中那样快速和有效。 Your example code seems to have some issues, but I understand it as: you want to compute the count of rows per day, for each day that occurs for a given did . 你的示例代码似乎有一些问题,但我把它理解为:要计算每天的行数,对于发生对于给定的每一天did Also, you are interested in the minimum (or maximum, or median, it does not matter) time of day for each group of values, ie for each day. 此外,您对每组价值的最小(或最大或中位,无关紧要)时间感兴趣,即每天。

Let's set up a small example table (tested on Oracle): 让我们设置一个小的示例表(在Oracle上测试):

create table t1 (id number primary key, created timestamp, did number, other_data varchar2(200));  

insert into t1 values (1, to_timestamp('2017-01-31 17:00:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'some text');
insert into t1 values (2, to_timestamp('2017-01-31 19:53:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'some more text');
insert into t1 values (3, to_timestamp('2017-02-01 08:10:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'another day');
insert into t1 values (4, to_timestamp('2017-02-01 15:55:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'another day, rainy afternoon');
insert into t1 values (5, to_timestamp('2017-02-01 15:59:00', 'YYYY-MM-DD HH24:MI:SS'), 9002, 'different did');
insert into t1 values (6, to_timestamp('2017-02-03 01:01:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'night shift');

We have some rows, spread over several days, for did 9001 . 我们有几行,分布在几天, 9001 There's also a value for did 9002 , which we'll ignore. 9002也有价值,我们会忽略它。 Now let's get the rows that you want to write into your second table as a simple SELECT .. GROUP BY : 现在让我们将您要写入第二个表的行作为一个简单的SELECT .. GROUP BY

select 
    count(*) cnt, 
    to_char(created, 'YYYY-MM-DD') day, 
    min(to_char(created, 'HH24:MI:SS')) min_time 
from t1 
where did = 9001
group by to_char(created, 'YYYY-MM-DD')
;

We are grouping all rows by the day of their created column (a timestamp). 我们按created列的时间(时间戳)对所有行进行分组。 We select the number of rows per group, the day itself, and - just for fun - the minimum time part of each group. 我们选择每组的行数,日期本身,以及 - 只是为了好玩 - 每组的最小时间部分。 Result: 结果:

cnt day         min_time
2   2017-02-01  08:10:00
1   2017-02-03  01:01:00
2   2017-01-31  17:00:00

So now you have your second table as a SELECT . 所以现在你将第二个表作为SELECT Creating a table from it is trivial: 从中创建表格是微不足道的:

create table t2 as
select
    ... as above
;

HTH! HTH!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM