简体   繁体   English

提高MySQLdb加载数据文件的性能

[英]Improving MySQLdb load data infile performance

I have a table that's roughly defined as follows in InnoDB: 我有一个在InnoDB中大致定义如下的表:

create table `my_table` (
  `time` int(10) unsigned not null,
  `key1` int(10) unsigned not null,
  `key3` char(3) unsigned not null,
  `key2` char(2) unsigned not null,
  `value1` float default null,
  `value2` float default null,
  primary key (`key1`, `key2`, `key3`, `time`),
  key (`key3`, `key2`, `key1`, `time`)
) engine=InnoDB default character set ascii
partition by range(time) (
  partition start        values less than (0),
  partition from20180101 values less than (unix_timestamp('2018-02-01')),
  partition from20180201 values less than (unix_timestamp('2018-03-01')),
  ...,
  partition future       values less than MAX_VALUE
)

Yes, the column order doesn't match the key order. 是的,列顺序与键顺序不匹配。

In Python I'm populating a DataFrame with 500,000 rows (this is probably not the most efficient way to do this, but serves as a sample for what the data may look like): 在Python中,我向一个DataFrame填充了500,000行(这可能不是最有效的方法,但是可以作为数据看起来像的一个示例):

import random
import pandas as pd
key2_values = ["aaa", "bbb", ..., "ttt"]  # 20 distinct values
key3_values = ["aa", "ab", "ac", ..., "az", "bb", "bc", ..., "by"]  # 50 distinct values
df = pd.DataFrame([], columns=["key1", "key2", "key3", "value2", "value1"])
idx = 0
for x in range(0, 500):
    for y in range(0, 20):
        for z in range(0, 50):
            df.loc[idx] = [x, key2_values[y], key3_values[z], random.random(), random.random()]
            idx += 1
df.set_index(["key1", "key2", "key3"], inplace=True)

(In reality this DataFrame is populated from several API calls and a lot of math, but the end result is the same: a huge DataFrame with ~500,000 rows and keys matching the InnoDB table) (实际上,此DataFrame是通过几个API调用和大量数学运算填充而成的,但最终结果是相同的:一个巨大的DataFrame,具有约500,000行和与InnoDB表匹配的键)

To import this DataFrame into the table, I'm currently doing the following: 要将这个DataFrame导入表中,我目前正在执行以下操作:

import time
import MySQLdb
conn = MySQLdb.connect(local_infile=1, **connection_params)
cur = conn.cursor()
# Disable data integrity checks -- I know the data is good
cur.execute("SET foreign_key_checks=0;")
cur.execute("SET unique_checks=0;")
# Append current time to the DataFrame
df["time"] = time.time()
df.set_index(["time"], append=True, inplace=True)
# Sort data in primary key order
df.sort_index(inplace=True)
# Dump the data to a CSV
with open("dump.csv", "w") as csv:
    df.to_csv(csv)
# Load the data
cur.execute(
    """
        load data local infile 'dump.csv'
        into table `my_table`
        fields terminated by ','
        enclosed by '"'
        lines terminated by '\n'
        ignore 1 lines
        (`key1`, `key2`, `key3`, `time`, `value`)
    """
)
# Clean up
cur.execute("SET foreign_key_checks=1;")
cur.execute("SET unique_checks=1;")
conn.commit()

In all the performance on this isn't too bad. 总的来说,这还不错。 I can import 500,000 rows in about 2 minutes. 我可以在2分钟内导入500,000行。 If possible I want to get this faster. 如果可能的话,我想让它更快。

Are there any tricks I'm missing or any changes I could make to get this down to 30-45 seconds? 有什么我想念的技巧或者我可以做些改变以将其降低到30-45秒吗?

Some notes: 一些注意事项:

  • I don't know if reordering the columns in the DataFrame will affect performance. 我不知道是否对DataFrame中的重新排序会影响性能。 Currently the order of columns in the DataFrame does not match the database 当前,DataFrame中的列顺序与数据库不匹配
  • I don't know if changing the order of the columns in the database to match the order of the primary key will affect performance (currently "time" comes first, even though it's the fourth key of the index) 我不知道更改数据库中列的顺序以匹配主键的顺序是否会影响性能(目前,“时间”排在第一位,即使它是索引的第四位)
  • Altering the database config could be difficult, as I don't have direct access to the database server. 更改数据库配置可能很困难,因为我没有直接访问数据库服务器的权限。 I'm stuck with whatever hardware and configuration options are already present. 我对已经存在的任何硬件和配置选项都感到困惑。 Any performance improvements must come from my Python code 任何性能改进都必须来自我的Python代码
  • I can change the table definition (including changing the partitioning) however I would like to avoid this if possible as there is already a large amount of historic data and copying it to another table would take a long time. 可以更改表定义(包括更改分区),但是如果可能的话,我想避免这种情况,因为已经有大量的历史数据并将其复制到另一个表将花费很长时间。 Losing this data is an option, but one I'd rather avoid 丢失此数据是一种选择,但我宁愿避免
  • I cannot use set sql_log_bin=0; 我不能使用set sql_log_bin=0; because I do not have the SUPER privilege on the database 因为我没有数据库SUPER权限

I've made three changes and I didn't stop to measure performance between each change, so I can't be 100% certain the exact impact of each change, however I can be reasonably sure I know what had the bigger impact. 我已经进行了三项更改,并且我并没有停止衡量每次更改之间的效果,因此我不能100%确定每项更改的确切影响,但是我可以肯定地知道影响最大的是什么。

Change 1 (pretty sure this had the biggest impact) -- Modified primary key 变更1(肯定会产生最大的影响)-修改后的主键

Looking at how my script operates, you can see that all 500k rows I'm bulk inserting have the exact same value for time : 查看我的脚本的运行方式,您可以看到我批量插入的所有500k行的time值完全相同:

# Append current time to the DataFrame
df["time"] = time.time

By making time the left-most column of the primary key meant that all of the rows I was inserting would be clustered together, rather than having to split them across the table. 通过将time放在主键的最左列,意味着我要插入的所有行都将聚集在一起,而不必在表中进行拆分。

Of course the problem with this is that it makes the index useless for my most common query: returning all "times" for a given key1 , key2 , and key3 combination (eg: SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ... ) 当然,这样做的问题是它使索引对我最常见的查询无用:返回给定key1key2key3组合的所有“时间”(例如: SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ...

To fix this, I had to add another key: 为了解决这个问题,我必须添加另一个密钥:

PRIMARY KEY (`time`, `key1`, `key2`, `key3`),
KEY (`key1`, `key2`, `key3`)

Change 2 (may have had an impact) -- Modified column order 变更2(可能有影响)-修改的列顺序

I adjusted the table so that the order of the columns matched the order of the primary key ( time , key1 , key2 , key3 ) 我调整了表,使列的顺序与主键的顺序( timekey1key2key3 )匹配

I don't know if this had an effect, but it might have 我不知道这是否有效果,但可能有

Change 3 (may have had an impact) -- Adjusted the order of columns in the CSV 变更3(可能会产生影响)-调整了CSV中的列顺序

I ran the following on my DataFrame: 我在DataFrame上运行了以下命令:

df.reindex(columns=["value1", "value2"], inplace=True)

This sorted the columns to match the order they appeared in the database. 这对列进行了排序,以匹配它们在数据库中出现的顺序。 Between this and change 2, the rows could be imported exactly as they were without needing to swap the order of columns. 在此更改与更改2之间,可以按原样导入行,而无需交换列的顺序。 I don't know if that has any impact on import performance 我不知道这是否会对进口业绩产生影响

Result 结果

With these three changes my import is down from 2 minutes to 9 seconds! 通过这三个更改,我的导入时间从2分钟降低到了9秒! That's absolutely incredible 绝对不可思议

I was worried about adding the extra key to the table since additional indexes means longer write times and more disk space, but the effect was almost negligible -- especially compared to the massive savings I got from clustering my key correctly. 我担心向表中添加额外的键,因为额外的索引意味着更长的写入时间和更多的磁盘空间,但是效果几乎可以忽略不计-尤其是与正确地对键进行群集所节省的成本相比。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM