为什么填充我的桌子需要这么长时间？

Question

I'm loading a csv file into my database via a web form. 我正在通过网络表单将csv文件加载到数据库中。

The order of the raw data is consistent in each csv file, but it changes from file to file, depending on the source, so I have a preview form that shows five rows and allows you to assign a column via a drop-down list of valid column names in the table. 每个csv文件中原始数据的顺序是一致的，但是它会根据源而在文件之间变化，因此我有一个预览表单，该表单显示了五行，并允许您通过下拉列表中的表中的有效列名。

Then I use the cgi form to build an INSERT statement, and parse the csv file line-by-line to populate the table. 然后，我使用cgi表单构建INSERT语句，并逐行分析csv文件以填充表。

But it is running EXTREMELY slow. 但是它运行速度极慢。 I'm concurrently populating two tables, one with 961402 rows (7 columns with values), and the other with 1835538 rows(1 column with values), and each has been running for at least half an hour. 我正在同时填充两个表，一个表具有961402行（带有值的7列），另一个表具有1835538行（带有值的1列），每个表已经运行了至少半个小时。 I'm only seeing something like 100 new rows per second. 我只看到每秒100条新行。

Can you see anything here that would slow me down? 您能在这里看到任何会使我慢下来的东西吗？

NOTE: I know there is some ugly code in here, it was one of the first python cgi scripts I wrote while figuring this language out. 注意：我知道这里有一些丑陋的代码，这是我在弄清楚这种语言时编写的第一个python cgi脚本之一。

 for item in form:
          field = form.getvalue(item)
          field = cgi.escape(field)
          if field == 'null':
                  pass
          elif item == 'csvfile':
                  pass
          elif item == 'campaign':
                  pass
          elif item == 'numfields':
                  pass
          else:
                  colname = str(colname) + ", " + str(item)

                  colnum.append(field)
  assert(numfields > 0)
  placeholders = (numfields-1) * "%s, " + "%s"
  query = ("insert into %s (%s)" % (table, colname.lstrip(",")))
  with open(fname, 'rb') as f:
          reader = csv.reader(f)
          try:
                  record = 0
                  errors = 0
                  for row in reader:
                          try:
                                  record = record + 1
                                  data = ''
                                  for value in colnum:
                                          col = int(value)
                                          rawrow = row[col]
                                          saferow = rawrow.replace("'", "-")
                                          saferow = saferow.replace("-", "")
                                          data = str(data) + ", '" + saferow + "'"
                                  dataset = data.lstrip(',')
                                  insert = query + (" values (%s)" % dataset)
                                  cur.execute(insert)
                                  con.commit()
                                  print ".",
                          except IndexError, e:
                                  print "Row:%d file %s, %s<br>" % (reader.line_num, fname.lstrip("./files/"), e)
                                  errors = errors + 1
                          except csv.Error, e:
                                  print "Row:%s file %s, line %d: %s<br>" % (record, fname, reader.line_num, e)
                                  errors = errors + 1
                          except mdb.Error, e:
                                  print "Row:%s Error %d: %s<br>" % (record, e.args[0], e.args[1])
                                  errors = errors + 1
                          except:
                                  t,v,tb = sys.exc_info()
                                  print "Row:%s %s<br>" % (record, v)
                                  errors = errors + 1
          except csv.Error, e:
                  print "except executed<br>"
                  sys.exit('file %s, line %d: %s' % (fname, reader.line_num, e))
  print "Succesfully loaded %s into Campaign %s, <br>" % (fname.lstrip("./files/"), table)
  print record - errors, "new records.<br>"
  print errors, "errors.<br>"

EDIT/UPDATE: Using LOAD DATA LOCAL INFILE worked like a charm, I loaded up 600K records in less than a minute. 编辑/更新：使用LOAD DATA LOCAL INFILE就像一个LOAD DATA LOCAL INFILE一样，我在不到一分钟的时间内加载了60万条记录。

New Code is cleaner, too. 新守则也更干净。

    else:
            colnum.append([field, item])
sortlist =  sorted(colnum, key=itemgetter(0))
cols = ''
for colname in sortlist:
    cols = cols + "%s, " % colname[1]
cur.execute("LOAD DATA LOCAL INFILE '%s' IGNORE INTO TABLE %s FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (%s)" % (fname, table, cols.rstrip(', ')))
con.commit()

The only catch is that I have to do a smidge more work preparing my csv files to ensure data integrity, otherwise, works like a charm. 唯一的问题是，我必须做更多的工作来准备我的csv文件以确保数据完整性，否则，它就像是一种魅力。

Answer 1

INSERT INTO, done one row at a time, is pretty slow considering that some SQLs, like mySQL, support either having a bunch of rows on a single insert command or LOAD DATA statements that can read CSV files quickly into the server. 考虑到某些SQL（例如mySQL）支持在单个插入命令上包含一排行或可以将CSV文件快速读取到服务器中的LOAD DATA语句，因此一次插入一行的INSERT INTO相当慢。

See also: https://dba.stackexchange.com/questions/16809/why-is-load-data-infile-faster-than-normal-insert-statements 另请参阅： https : //dba.stackexchange.com/questions/16809/why-is-load-data-infile-faster-than-normal-insert-statements

Answer 2

Some quick pseudocode. 一些快速的伪代码。 Do this: 做这个：

for row in data_to_be_inserted:
    stmt = compose_statement("lalala")
    cursor.execute()

connection.commit()

not 不

for row in data_to_be_inserted:
    stmt = compose_statement("lalala")
    cursor.execute()
    connection.commit()

Your code commit()s once per line of input. 您的代码commit（）每行输入一次。 That slows it down significantly. 这大大降低了速度。

为什么填充我的桌子需要这么长时间？

问题描述

2 个解决方案

解决方案1
4 已采纳 2012-05-01 18:39:20

解决方案2
1 2012-05-01 18:46:46

为什么填充我的桌子需要这么长时间？

问题描述

2 个解决方案

解决方案1 4 已采纳 2012-05-01 18:39:20

解决方案2 1 2012-05-01 18:46:46

解决方案1
4 已采纳 2012-05-01 18:39:20

解决方案2
1 2012-05-01 18:46:46