简体   繁体   中英

Improving MySQLdb load data infile performance

I have a table that's roughly defined as follows in InnoDB:

create table `my_table` (
  `time` int(10) unsigned not null,
  `key1` int(10) unsigned not null,
  `key3` char(3) unsigned not null,
  `key2` char(2) unsigned not null,
  `value1` float default null,
  `value2` float default null,
  primary key (`key1`, `key2`, `key3`, `time`),
  key (`key3`, `key2`, `key1`, `time`)
) engine=InnoDB default character set ascii
partition by range(time) (
  partition start        values less than (0),
  partition from20180101 values less than (unix_timestamp('2018-02-01')),
  partition from20180201 values less than (unix_timestamp('2018-03-01')),
  ...,
  partition future       values less than MAX_VALUE
)

Yes, the column order doesn't match the key order.

In Python I'm populating a DataFrame with 500,000 rows (this is probably not the most efficient way to do this, but serves as a sample for what the data may look like):

import random
import pandas as pd
key2_values = ["aaa", "bbb", ..., "ttt"]  # 20 distinct values
key3_values = ["aa", "ab", "ac", ..., "az", "bb", "bc", ..., "by"]  # 50 distinct values
df = pd.DataFrame([], columns=["key1", "key2", "key3", "value2", "value1"])
idx = 0
for x in range(0, 500):
    for y in range(0, 20):
        for z in range(0, 50):
            df.loc[idx] = [x, key2_values[y], key3_values[z], random.random(), random.random()]
            idx += 1
df.set_index(["key1", "key2", "key3"], inplace=True)

(In reality this DataFrame is populated from several API calls and a lot of math, but the end result is the same: a huge DataFrame with ~500,000 rows and keys matching the InnoDB table)

To import this DataFrame into the table, I'm currently doing the following:

import time
import MySQLdb
conn = MySQLdb.connect(local_infile=1, **connection_params)
cur = conn.cursor()
# Disable data integrity checks -- I know the data is good
cur.execute("SET foreign_key_checks=0;")
cur.execute("SET unique_checks=0;")
# Append current time to the DataFrame
df["time"] = time.time()
df.set_index(["time"], append=True, inplace=True)
# Sort data in primary key order
df.sort_index(inplace=True)
# Dump the data to a CSV
with open("dump.csv", "w") as csv:
    df.to_csv(csv)
# Load the data
cur.execute(
    """
        load data local infile 'dump.csv'
        into table `my_table`
        fields terminated by ','
        enclosed by '"'
        lines terminated by '\n'
        ignore 1 lines
        (`key1`, `key2`, `key3`, `time`, `value`)
    """
)
# Clean up
cur.execute("SET foreign_key_checks=1;")
cur.execute("SET unique_checks=1;")
conn.commit()

In all the performance on this isn't too bad. I can import 500,000 rows in about 2 minutes. If possible I want to get this faster.

Are there any tricks I'm missing or any changes I could make to get this down to 30-45 seconds?

Some notes:

  • I don't know if reordering the columns in the DataFrame will affect performance. Currently the order of columns in the DataFrame does not match the database
  • I don't know if changing the order of the columns in the database to match the order of the primary key will affect performance (currently "time" comes first, even though it's the fourth key of the index)
  • Altering the database config could be difficult, as I don't have direct access to the database server. I'm stuck with whatever hardware and configuration options are already present. Any performance improvements must come from my Python code
  • I can change the table definition (including changing the partitioning) however I would like to avoid this if possible as there is already a large amount of historic data and copying it to another table would take a long time. Losing this data is an option, but one I'd rather avoid
  • I cannot use set sql_log_bin=0; because I do not have the SUPER privilege on the database

I've made three changes and I didn't stop to measure performance between each change, so I can't be 100% certain the exact impact of each change, however I can be reasonably sure I know what had the bigger impact.

Change 1 (pretty sure this had the biggest impact) -- Modified primary key

Looking at how my script operates, you can see that all 500k rows I'm bulk inserting have the exact same value for time :

# Append current time to the DataFrame
df["time"] = time.time

By making time the left-most column of the primary key meant that all of the rows I was inserting would be clustered together, rather than having to split them across the table.

Of course the problem with this is that it makes the index useless for my most common query: returning all "times" for a given key1 , key2 , and key3 combination (eg: SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ... )

To fix this, I had to add another key:

PRIMARY KEY (`time`, `key1`, `key2`, `key3`),
KEY (`key1`, `key2`, `key3`)

Change 2 (may have had an impact) -- Modified column order

I adjusted the table so that the order of the columns matched the order of the primary key ( time , key1 , key2 , key3 )

I don't know if this had an effect, but it might have

Change 3 (may have had an impact) -- Adjusted the order of columns in the CSV

I ran the following on my DataFrame:

df.reindex(columns=["value1", "value2"], inplace=True)

This sorted the columns to match the order they appeared in the database. Between this and change 2, the rows could be imported exactly as they were without needing to swap the order of columns. I don't know if that has any impact on import performance

Result

With these three changes my import is down from 2 minutes to 9 seconds! That's absolutely incredible

I was worried about adding the extra key to the table since additional indexes means longer write times and more disk space, but the effect was almost negligible -- especially compared to the massive savings I got from clustering my key correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM