I have a python program that connects to a PostGreSQL database. In this database I have quite a lot of data (around 1.2 billion rows). Luckily I don't have to analyse all of those rows at the same time.
Those 1.2 billion rows are spread on several tables (around 30). Currently I am accessing a table called table_3, in which I want to access all the rows that has a specific "did" value (as the column is called).
I have counted the rows using a SQL command:
SELECT count(*) FROM table_3 WHERE did='356002062376054';
which returns with 157 million rows.
I will perform some "analysis" on all of these rows (extracting 2 specific values) and doing some calculations on these values, followed by writing them to a dictionary and then save them back on the PostGreSQL in a different table.
The problem is I'm am creating a lot of lists and dictionaries in managing all this I end up running out of memory even though I am using Python 3 64 bit and have 64 GB of RAM.
Some code:
CONNECTION = psycopg2.connect('<psycopg2 formatted string>')
CURSOR = CONNECTION.cursor()
DID_LIST = ["357139052424715",
"353224061929963",
"356002064810514",
"356002064810183",
"358188051768472",
"358188050598029",
"356002061925067",
"358188056470108",
"356002062376054",
"357460064130045"]
SENSOR_LIST = [1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 801, 900, 901,
902, 903, 904, 905, 906, 907,
908, 909, 910, 911]
for did in did_list:
table_name = did
for sensor_id in sensor_list:
rows = get_data(did, sensor_id)
list_object = create_standard_list(sensor_id, rows) # Happens here
formatted_list = format_table_dictionary(list_object) # Or here
pushed_rows = write_to_table(table_name, formatted_list) #write_to_table method is omitted as that is not my problem.
def get_data(did, table_id):
"""Getting data from postgresql."""
table_name = "table_{0}".format(table_id)
query = """SELECT * FROM {0} WHERE did='{1}'
ORDER BY timestamp""".format(table_name, did)
CURSOR.execute(query)
CONNECTION.commit()
return CURSOR
def create_standard_list(sensor_id, data):
"""Formats DB data to dictionary"""
list_object = []
print("Create standard list")
for row in data: # data is the psycopg2 CURSOR
row_timestamp = row[2]
row_data = row[3]
temp_object = {"sensor_id": sensor_id, "timestamp": row_timestamp,
"data": row_data}
list_object.append(temp_object)
return list_object
def format_table_dictionary(list_dict):
"""Formats dictionary to simple data
table_name = (dates, data_count, first row)"""
print("Formatting dict to DB")
temp_today = 0
dict_list = []
first_row = {}
count = 1
for elem in list_dict:
# convert to seconds
date = datetime.fromtimestamp(elem['timestamp'] / 1000)
today = int(date.strftime('%d'))
if temp_today is not today:
if not first_row:
first_row = elem['data']
first_row_str = str(first_row)
dict_object = {"sensor_id": elem['sensor_id'],
"date": date.strftime('%d/%m-%Y'),
"reading_count": count,
# size in MB of data
"approx_data_size": (count*len(first_row_str)/1000),
"time": date.strftime('%H:%M:%S'),
"first_row": first_row}
dict_list.append(dict_object)
first_row = {}
temp_today = today
count = 0
else:
count += 1
return dict_list
My error happens somewhere around creating either of the two lists as marked with comments in my code. And it represents with my computer stopping responding, and eventually logging me out. I am running windows 10 if that is some importance.
I know the first list I create with the "create_standard_list" method could be excluded and that code could be run in the "format_table_dictionary" code, and thereby avoid a list with 157 mio element in memory, but I think that some of the other tables that I will run into will have similar problems and might be even larger, so I thought of optimizing it all right now, but I am unsure of what I could do?
I guess writing to a file wouldn't really help a whole lot as I would have to read that file and thereby putting it back into memory all again?
I have a table
---------------------------------------------------------------
|Row 1 | did | timestamp | data | unused value | unused value |
|Row 2 | did | timestamp | data | unused value | unused value |
....
---------------------------------
table = [{ values from above row1 }, { values from above row2},...]
connection = psycopg2.connect(<connection string>)
cursor = connection.cursor()
table = cursor.execute("""SELECT * FROM table_3 WHERE did='356002062376054'
ORDER BY timestamp""")
extracted_list = extract(table)
calculated_list = calculate(extracted_list)
... write to db ...
def extract(table):
"""extract all but unused values"""
new_list = []
for row in table:
did = row[0]
timestamp = row[1]
data = row[2]
a_dict = {'did': did, 'timestamp': timestamp, 'data': data}
new_list.append(a_dict)
return new_list
def calculate(a_list):
"""perform calculations on values"""
dict_list = []
temp_today = 0
count = 0
for row in a_list:
date = datetime.fromtimestamp(row['timestamp'] / 1000) # from ms to sec
today = int(date.strfime('%d'))
if temp_today is not today:
new_dict = {'date': date.strftime('%d/%m-%Y'),
'reading_count': count,
'time': date.strftime('%H:%M:%S')}
dict_list.append(new_dict)
return dict_list
create_standard_list()
and format_table_dictionary()
could build generators ( yield
ing each item instead of return
ing the full lists), this stops holding the whole lists in memory and so should solve your issue, for example:
def create_standard_list(sensor_id, data):
for row in data:
row_timestamp = row[2]
row_data = row[3]
temp_object = {"sensor_id": sensor_id, "timestamp": row_timestamp,
"data": row_data}
yield temp_object
#^ yield each item instead of appending to a list
Further information on generators and the yield
keyword .
What you are trying to do here, IIUC, is to emulate an SQL GROUP BY
expression in Python code. This can never be as quick and memory as efficient as doing it directly in the database. Your example code seems to have some issues, but I understand it as: you want to compute the count of rows per day, for each day that occurs for a given did
. Also, you are interested in the minimum (or maximum, or median, it does not matter) time of day for each group of values, ie for each day.
Let's set up a small example table (tested on Oracle):
create table t1 (id number primary key, created timestamp, did number, other_data varchar2(200));
insert into t1 values (1, to_timestamp('2017-01-31 17:00:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'some text');
insert into t1 values (2, to_timestamp('2017-01-31 19:53:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'some more text');
insert into t1 values (3, to_timestamp('2017-02-01 08:10:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'another day');
insert into t1 values (4, to_timestamp('2017-02-01 15:55:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'another day, rainy afternoon');
insert into t1 values (5, to_timestamp('2017-02-01 15:59:00', 'YYYY-MM-DD HH24:MI:SS'), 9002, 'different did');
insert into t1 values (6, to_timestamp('2017-02-03 01:01:00', 'YYYY-MM-DD HH24:MI:SS'), 9001, 'night shift');
We have some rows, spread over several days, for did 9001
. There's also a value for did 9002
, which we'll ignore. Now let's get the rows that you want to write into your second table as a simple SELECT .. GROUP BY
:
select
count(*) cnt,
to_char(created, 'YYYY-MM-DD') day,
min(to_char(created, 'HH24:MI:SS')) min_time
from t1
where did = 9001
group by to_char(created, 'YYYY-MM-DD')
;
We are grouping all rows by the day of their created
column (a timestamp). We select the number of rows per group, the day itself, and - just for fun - the minimum time part of each group. Result:
cnt day min_time
2 2017-02-01 08:10:00
1 2017-02-03 01:01:00
2 2017-01-31 17:00:00
So now you have your second table as a SELECT
. Creating a table from it is trivial:
create table t2 as
select
... as above
;
HTH!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.