I have a table that's roughly defined as follows in InnoDB:
create table `my_table` (
`time` int(10) unsigned not null,
`key1` int(10) unsigned not null,
`key3` char(3) unsigned not null,
`key2` char(2) unsigned not null,
`value1` float default null,
`value2` float default null,
primary key (`key1`, `key2`, `key3`, `time`),
key (`key3`, `key2`, `key1`, `time`)
) engine=InnoDB default character set ascii
partition by range(time) (
partition start values less than (0),
partition from20180101 values less than (unix_timestamp('2018-02-01')),
partition from20180201 values less than (unix_timestamp('2018-03-01')),
...,
partition future values less than MAX_VALUE
)
Yes, the column order doesn't match the key order.
In Python I'm populating a DataFrame with 500,000 rows (this is probably not the most efficient way to do this, but serves as a sample for what the data may look like):
import random
import pandas as pd
key2_values = ["aaa", "bbb", ..., "ttt"] # 20 distinct values
key3_values = ["aa", "ab", "ac", ..., "az", "bb", "bc", ..., "by"] # 50 distinct values
df = pd.DataFrame([], columns=["key1", "key2", "key3", "value2", "value1"])
idx = 0
for x in range(0, 500):
for y in range(0, 20):
for z in range(0, 50):
df.loc[idx] = [x, key2_values[y], key3_values[z], random.random(), random.random()]
idx += 1
df.set_index(["key1", "key2", "key3"], inplace=True)
(In reality this DataFrame is populated from several API calls and a lot of math, but the end result is the same: a huge DataFrame with ~500,000 rows and keys matching the InnoDB table)
To import this DataFrame into the table, I'm currently doing the following:
import time
import MySQLdb
conn = MySQLdb.connect(local_infile=1, **connection_params)
cur = conn.cursor()
# Disable data integrity checks -- I know the data is good
cur.execute("SET foreign_key_checks=0;")
cur.execute("SET unique_checks=0;")
# Append current time to the DataFrame
df["time"] = time.time()
df.set_index(["time"], append=True, inplace=True)
# Sort data in primary key order
df.sort_index(inplace=True)
# Dump the data to a CSV
with open("dump.csv", "w") as csv:
df.to_csv(csv)
# Load the data
cur.execute(
"""
load data local infile 'dump.csv'
into table `my_table`
fields terminated by ','
enclosed by '"'
lines terminated by '\n'
ignore 1 lines
(`key1`, `key2`, `key3`, `time`, `value`)
"""
)
# Clean up
cur.execute("SET foreign_key_checks=1;")
cur.execute("SET unique_checks=1;")
conn.commit()
In all the performance on this isn't too bad. I can import 500,000 rows in about 2 minutes. If possible I want to get this faster.
Are there any tricks I'm missing or any changes I could make to get this down to 30-45 seconds?
Some notes:
set sql_log_bin=0;
because I do not have the SUPER
privilege on the database I've made three changes and I didn't stop to measure performance between each change, so I can't be 100% certain the exact impact of each change, however I can be reasonably sure I know what had the bigger impact.
Looking at how my script operates, you can see that all 500k rows I'm bulk inserting have the exact same value for time
:
# Append current time to the DataFrame
df["time"] = time.time
By making time
the left-most column of the primary key meant that all of the rows I was inserting would be clustered together, rather than having to split them across the table.
Of course the problem with this is that it makes the index useless for my most common query: returning all "times" for a given key1
, key2
, and key3
combination (eg: SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ...
)
To fix this, I had to add another key:
PRIMARY KEY (`time`, `key1`, `key2`, `key3`),
KEY (`key1`, `key2`, `key3`)
I adjusted the table so that the order of the columns matched the order of the primary key ( time
, key1
, key2
, key3
)
I don't know if this had an effect, but it might have
I ran the following on my DataFrame:
df.reindex(columns=["value1", "value2"], inplace=True)
This sorted the columns to match the order they appeared in the database. Between this and change 2, the rows could be imported exactly as they were without needing to swap the order of columns. I don't know if that has any impact on import performance
With these three changes my import is down from 2 minutes to 9 seconds! That's absolutely incredible
I was worried about adding the extra key to the table since additional indexes means longer write times and more disk space, but the effect was almost negligible -- especially compared to the massive savings I got from clustering my key correctly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.