Optimizing pandas computation

Question

I have 22 million rows of house property sale data in a database table called sale_transactions. I am performing a job where I read information from this table, perform some calculations, and use the results to create entries to a new table. The process looks like this:

for index, row in zipcodes.iterrows(): # ~100k zipcodes
    sql_string = """SELECT * from sale_transactions WHERE zipcode = '{ZIPCODE}' """
    sql_query = sql_string.format(ZIPCODE=row['zipcode'])         
    df = pd.read_sql(sql_query, _engine)
    area_stat = create_area_stats(df) # function does calculations
    area_stat.save() # saves a Django model

At the moment each iteration of this loop takes about 20 seconds on my macbook pro (16GB RAM), which means that the code is going to take weeks to finish. The expensive part is the read_sql line.

How can I optimize this? I can't read the whole sale_transactions table into memory, it is about 5 GB, hence using the sql query each time to capture the relevant rows with the WHERE clause.

Most answers about optimizing pandas talk about reading with chunking, but in this case I need to perform the WHERE on all the data combined, since I am performing calculations in the create_area_stats function like number of sales over a ten year period. I don't have easy access to a machine with loads of RAM, unless I start going to town with EC2, which I worry will be expensive and quite a lot of hassle.

Suggestions would be greatly appreciated.

Answer 1

I also faced similar problem and the below code helped me to read database (~ 40 million rows) effectively .

offsetID = 0
totalrow = 0



while (True):

    df_Batch=pd.read_sql_query('set work_mem="1024MB"; SELECT * FROM '+tableName+' WHERE row_number > '+ str(offsetID) +' ORDER BY row_number LIMIT 100000' ,con=engine)
    offsetID = offsetID + len(df_Batch)

    #your operation

    totalrow = totalrow + len(df_Batch)

you have to create a index called row_number in your table. So this code will read your table (100 000 rows) index wise. for example when you want to read rows from 200 000 - 210 000 you don't need to read from 0 to 210 000. It will directly read by index. So It will improve your performance.

Answer 2

Since the bottleneck in the operation was the SQL WHERE query, the solution was to index the column upon which the WHERE statement was operating (ie the zipcode column).

In MySQL, the command for this was:

ALTER TABLE `db_name`.`table` 
ADD INDEX `zipcode_index` USING BTREE (`zipcode` ASC);

After making this change, the loop execution speed increased by 8 fold.

I found this article useful because it encouraged profiling queries using EXPLAIN and observing opportunities for column indexing when key and possible_key values were NULL

Optimizing pandas computation

Question

2 answers

solution1
1 2017-07-28 06:09:24

solution2
0 ACCPTED 2017-08-01 21:00:20

Optimizing pandas computation

Question

2 answers

solution1 1 2017-07-28 06:09:24

solution2 0 ACCPTED 2017-08-01 21:00:20

solution1
1 2017-07-28 06:09:24

solution2
0 ACCPTED 2017-08-01 21:00:20