简体   繁体   中英

Help with MySQL LOAD DATA INFILE

I want to load a CSV file that looks like this:

Acct. No.,1-15 Days,16-30 Days,31-60 Days,61-90 Days,91-120 Days,Beyond 120 Days
2314134101,898.89,8372.16,5584.23,7744.41,9846.54,2896.25
2414134128,5457.61,7488.26,9594.02,6234.78,273.7,2356.13
2513918869,2059.59,7578.59,9395.51,7159.15,5827.48,3041.62
1687950783,4846.85,8364.22,9892.55,7213.45,8815.33,7603.4
2764856043,5250.11,9946.49,8042.03,6058.64,9194.78,8296.2
2865446086,596.22,7670.04,8564.08,3263.85,9662.46,7027.22
,4725.99,1336.24,9356.03,1572.81,4942.11,6088.94
,8248.47,956.81,8713.06,2589.14,5316.68,1543.67
,538.22,1473.91,3292.09,6843.89,2687.07,9808.05
,9885.85,2730.72,6876,8024.47,1196.87,1655.29

But if you notice, some of the fields are incomplete. I'm thinking MySQL will just skip the row where the first column is missing. When I run the command:

LOAD DATA LOCAL INFILE 'test-long.csv' REPLACE INTO TABLE accounts
    FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n'
    IGNORE 1 LINES
    (cf_535, cf_580, cf_568, cf_569, cf_571, cf_572);

And the MySQL output is:

Query OK, 41898 rows affected, 20948 warnings (0.78 sec)
Records: 20949  Deleted: 20949  Skipped: 0  Warnings: 20948

The number of lines is only 20,949 but MySQL reports it as 41,898 rows affected. Why so? Also, nothing really changed in the table. I also couldn't see what the warnings generated is all about. I wanted to use the LOAD DATA INFILE because it takes python half a second to update each row which translates to 2.77 hours for a file with 20,000+ records.

UPDATE: Modified the code to set auto-commit to 'False' and added a db.commit() statement:

# Tell MySQLdb to turn off auto-commit
db.autocommit(False) 

# Set count to 1
count = 1
while count < len(contents):
    if contents[count][0] != '':
        cursor.execute("""
            UPDATE accounts SET cf_580 = %s, cf_568 = %s, cf_569 = %s, cf_571 = %s, cf_572 = %s
            WHERE cf_535 = %s""" % (contents[count][1], contents[count][2], contents[count][3], contents[count][4], contents[count][5], contents[count][0]))
    count += 1

try:
    db.commit()
except:
    db.rollback()

You have basically 3 issues here. In reverse order

  1. Are you doing your Python inserts in individual statements? You probably want to surround them all with a begin transaction/commit. 20,000 commits could easily take hours.
  2. Your import statement defines 6 fields, but the CSV has 7 fields. That would explain the double row count: every line of input results in 2 rows in the database, the 2nd one with fields 2-6 null.
  3. Incomplete rows will be inserted with null or default values for the missing columns. This may not be what you want with those malformed rows.

If your python program can't perform fast enough even with a single transaction, you should at least have the python program edit/clean the data file before importing. If Acct. No. is the primary key, as seems reasonable, inserting rows with blank will either cause the whole import to fail, or if auto number is on, cause bogus data to be imported.

如果在LOAD DATA中使用REPLACE关键字,则“已删除:”之后的数字表示实际替换了多少行

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM