简体   繁体   English

根据另一个表中列的名称在MySQL中创建表

[英]Creating tables in MySQL based on the names of the columns in another table

I have a table with ~133M rows and 16 columns. 我有一张约有1.33亿行和16列的表。 I want to create 14 tables on another database on the same server for each of columns 3-16 (columns 1 and 2 are `id` and `timestamp` which will be in the final 14 tables as well but won't have their own table), where each table will have the name of the original column. 我想为同一列服务器上的3-16列中的每一个在另一个服务器上的另一个数据库上创建14个表(第1列和第2列是`id``timestamp` ,它们也会出现在最后的14个表中,但不会有自己的table),其中每个表都有原始列的名称。 Is this possible to do exclusively with an SQL script? 是否可以仅使用SQL脚本? It seems logical to me that this would be the preferred, and fastest way to do it. 在我看来,这将是首选且最快的方式,这是合乎逻辑的。

Currently, I have a Python script that "works" by parsing the CSV dump of the original table (testing with 50 rows), creating new tables, and adding the associated values, but it is very slow (I estimated almost 1 year to transfer all 133M rows, which is obviously not acceptable). 目前,我有一个Python脚本,可以通过解析原始表的CSV转储(对50行进行测试),创建新表并添加关联的值来“工作”,但速度非常慢(我估计转移将近1年)全部133M行,这显然是不可接受的)。 This is my first time using SQL in any capacity, and I'm certain that my code can be sped up, but I'm not sure how because of my unfamiliarity with SQL. 这是我第一次以任何身份使用SQL,并且可以肯定我的代码可以加速使用,但是由于我对SQL不熟悉,所以我不确定如何使用。 The big SQL string command in the middle was copied from some other code in our codebase. 中间的大SQL字符串命令是从我们代码库中的其他代码复制而来的。 I've tried using transactions as seen below, but it didn't seem to have any significant effect on the speed. 我尝试使用如下所示的事务,但似乎对速度没有任何重大影响。

import re
import mysql.connector
import time

# option flags
debug = False  # prints out information during runtime
timing = True  # times the execution time of the program

# save start time for timing. won't be used later if timing is false
start_time = time.time()

# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')

# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str)  # parse columns with regex to remove commas and vasiala_
if debug:
    print(columns)

# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
                              host='127.0.0.1',
                              database='measurements')
cursor = cnx.cursor()

# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
    table_name = 'vaisala2_' + columns[i]
    sql_command = "CREATE TABLE IF NOT EXISTS " + \
                  table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
                               "`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
                               "`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
                               "`value` varchar(255) DEFAULT NULL, " \
                               "PRIMARY KEY (`id`), " \
                               "UNIQUE KEY `milliseconds` (`milliseconds`)" \
                               "COMMENT 'Eliminates duplicate millisecond values', " \
                               "KEY `timestamp` (`timestamp`)) " \
                               "ENGINE=InnoDB DEFAULT CHARSET=utf8;"

    if debug:
        print("Creating table", table_name, "in database")

    cursor.execute(sql_command)

# read in rest of lines in CSV file
for line in file.readlines():
    cursor.execute("START TRANSACTION;")
    line = line.strip()
    values = re.split(',"|",|,', line)  # regex split along commas, or commas and quotes
    if debug:
        print(values)

    # iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
    for i in range(2, len(columns)):
        table_name = "vaisala2_" + columns[i]
        timestamp = values[1]

        # translate timestamp back to epoch time
        try:
            pattern = '%Y-%m-%d %H:%M:%S'
            epoch = int(time.mktime(time.strptime(timestamp, pattern)))
            milliseconds = epoch * 1000  # convert seconds to ms
        except ValueError:  # errors default to 0
            milliseconds = 0

        value = values[i]

        # generate SQL command to insert data into destination table
        sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
                                                                                 milliseconds, value)
        if debug:
            print(sql_command)

        cursor.execute(sql_command)
cnx.commit()  # commits changes in destination MySQL server

# print total execution time
if timing:
    print("Completed in %s seconds" % (time.time() - start_time))

This doesn't need to be incredibly optimized; 这不需要进行令人难以置信的优化。 it's perfectly acceptable if the machine has to run for a few days in order to do it. 如果机器必须运行几天才能完成,则完全可以接受。 But 1 year is far too long. 但是1年太长了。

You can create a table from a SELECT like: 您可以通过SELECT创建表,例如:

CREATE TABLE <other database name>.<column name>
             AS
             SELECT <column name>
                    FROM <original database name>.<table name>;

(Replace the <...> with your actual object names or extend it with other columns or a WHERE clause or ...) (用您的实际对象名替换<...>或用其他列或WHERE子句或...扩展它。)

That will also insert the data from the query into the new table. 这还将把查询中的数据插入到新表中。 And it's probably the fastest way. 这可能是最快的方法。

You could use dynamic SQL and information from the catalog (namely information_schema.columns ) to create the CREATE statements or create them manually, which is annoying but acceptable for 14 columns I guess. 您可以使用动态SQL和目录中的information_schema.columns (即information_schema.columns )来创建CREATE语句或手动创建它们,这很烦人,但我认为14列是可以接受的。

When using scripts to talk to databases you want to minimise the number of messages that are sent as each message creates a further delay on your execution time. 当使用脚本与数据库对话时,您希望将发送的消息数减到最少,因为每条消息都会进一步延长执行时间。 Currently, it looks as if you are sending (by your approximation) 133 million messages, and thus, slowing down your script 133 million times. 当前,似乎您正在发送(近似)1.33亿条消息,因此使脚本速度降低了1.33亿次。 A simple optimisation would be to parse your spreadsheet and split the data into the tables (either in memory or saving them to disk) and only then send the data to the new DB. 一种简单的优化方法是解析电子表格,然后将数据拆分为表(在内存中或将其保存到磁盘),然后将数据发送到新的数据库。

As you hinted, it's much quicker to write an SQL script to redistribute the data. 正如您所暗示的那样,编写SQL脚本以重新分发数据要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM