提高MySQLdb加載數據文件的性能

Question

我有一個在InnoDB中大致定義如下的表：

create table `my_table` (
  `time` int(10) unsigned not null,
  `key1` int(10) unsigned not null,
  `key3` char(3) unsigned not null,
  `key2` char(2) unsigned not null,
  `value1` float default null,
  `value2` float default null,
  primary key (`key1`, `key2`, `key3`, `time`),
  key (`key3`, `key2`, `key1`, `time`)
) engine=InnoDB default character set ascii
partition by range(time) (
  partition start        values less than (0),
  partition from20180101 values less than (unix_timestamp('2018-02-01')),
  partition from20180201 values less than (unix_timestamp('2018-03-01')),
  ...,
  partition future       values less than MAX_VALUE
)

是的，列順序與鍵順序不匹配。

在Python中，我向一個DataFrame填充了500,000行（這可能不是最有效的方法，但是可以作為數據看起來像的一個示例）：

import random
import pandas as pd
key2_values = ["aaa", "bbb", ..., "ttt"]  # 20 distinct values
key3_values = ["aa", "ab", "ac", ..., "az", "bb", "bc", ..., "by"]  # 50 distinct values
df = pd.DataFrame([], columns=["key1", "key2", "key3", "value2", "value1"])
idx = 0
for x in range(0, 500):
    for y in range(0, 20):
        for z in range(0, 50):
            df.loc[idx] = [x, key2_values[y], key3_values[z], random.random(), random.random()]
            idx += 1
df.set_index(["key1", "key2", "key3"], inplace=True)

（實際上，此DataFrame是通過幾個API調用和大量數學運算填充而成的，但最終結果是相同的：一個巨大的DataFrame，具有約500,000行和與InnoDB表匹配的鍵）

要將這個DataFrame導入表中，我目前正在執行以下操作：

import time
import MySQLdb
conn = MySQLdb.connect(local_infile=1, **connection_params)
cur = conn.cursor()
# Disable data integrity checks -- I know the data is good
cur.execute("SET foreign_key_checks=0;")
cur.execute("SET unique_checks=0;")
# Append current time to the DataFrame
df["time"] = time.time()
df.set_index(["time"], append=True, inplace=True)
# Sort data in primary key order
df.sort_index(inplace=True)
# Dump the data to a CSV
with open("dump.csv", "w") as csv:
    df.to_csv(csv)
# Load the data
cur.execute(
    """
        load data local infile 'dump.csv'
        into table `my_table`
        fields terminated by ','
        enclosed by '"'
        lines terminated by '\n'
        ignore 1 lines
        (`key1`, `key2`, `key3`, `time`, `value`)
    """
)
# Clean up
cur.execute("SET foreign_key_checks=1;")
cur.execute("SET unique_checks=1;")
conn.commit()

總的來說，這還不錯。 我可以在2分鍾內導入500,000行。 如果可能的話，我想讓它更快。

有什么我想念的技巧或者我可以做些改變以將其降低到30-45秒嗎？

一些注意事項：

我不知道是否對DataFrame中的列重新排序會影響性能。 當前，DataFrame中的列順序與數據庫不匹配
我不知道更改數據庫中列的順序以匹配主鍵的順序是否會影響性能（目前，“時間”排在第一位，即使它是索引的第四位）
更改數據庫配置可能很困難，因為我沒有直接訪問數據庫服務器的權限。 我對已經存在的任何硬件和配置選項都感到困惑。 任何性能改進都必須來自我的Python代碼
我可以更改表定義（包括更改分區），但是如果可能的話，我想避免這種情況，因為已經有大量的歷史數據並將其復制到另一個表將花費很長時間。 丟失此數據是一種選擇，但我寧願避免
我不能使用set sql_log_bin=0; 因為我沒有數據庫SUPER權限

Answer 1

我已經進行了三項更改，並且我並沒有停止衡量每次更改之間的效果，因此我不能100％確定每項更改的確切影響，但是我可以肯定地知道影響最大的是什么。

變更1（肯定會產生最大的影響）-修改后的主鍵

查看我的腳本的運行方式，您可以看到我批量插入的所有500k行的time值完全相同：

# Append current time to the DataFrame
df["time"] = time.time

通過將time放在主鍵的最左列，意味着我要插入的所有行都將聚集在一起，而不必在表中進行拆分。

當然，這樣做的問題是它使索引對我最常見的查詢無用：返回給定key1 ， key2和key3組合的所有“時間”（例如： SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ... ）

為了解決這個問題，我必須添加另一個密鑰：

PRIMARY KEY (`time`, `key1`, `key2`, `key3`),
KEY (`key1`, `key2`, `key3`)

變更2（可能有影響）-修改的列順序

我調整了表，使列的順序與主鍵的順序（ time ， key1 ， key2 ， key3 ）匹配

我不知道這是否有效果，但可能有

變更3（可能會產生影響）-調整了CSV中的列順序

我在DataFrame上運行了以下命令：

df.reindex(columns=["value1", "value2"], inplace=True)

這對列進行了排序，以匹配它們在數據庫中出現的順序。 在此更改與更改2之間，可以按原樣導入行，而無需交換列的順序。 我不知道這是否會對進口業績產生影響

結果

通過這三個更改，我的導入時間從2分鍾降低到了9秒！ 絕對不可思議

我擔心向表中添加額外的鍵，因為額外的索引意味着更長的寫入時間和更多的磁盤空間，但是效果幾乎可以忽略不計-尤其是與正確地對鍵進行群集所節省的成本相比。

提高MySQLdb加載數據文件的性能

問題描述

1 個解決方案

解決方案1
1 已采納 2018-03-16 21:53:13

變更1（肯定會產生最大的影響）-修改后的主鍵

變更2（可能有影響）-修改的列順序

變更3（可能會產生影響）-調整了CSV中的列順序

結果

提高MySQLdb加載數據文件的性能

問題描述

1 個解決方案

解決方案1 1 已采納 2018-03-16 21:53:13

變更1（肯定會產生最大的影響）-修改后的主鍵

變更2（可能有影響）-修改的列順序

變更3（可能會產生影響）-調整了CSV中的列順序

結果

解決方案1
1 已采納 2018-03-16 21:53:13