Python MySQLdb file load truncating rows, works fine when loading file from another mysql client

Question

I'm getting data loss when doing a csv import using the Python MySQLdb module. The crazy thing is that I can load the exact same csv using other MySQL clients and it works fine.

It works perfectly fine when running the exact same command with the exact same csv from sequel pro mysql client
It works perfectly fine when running the exact same command with the exact same csv from the mysql command line
It doesn't work (some rows truncated) when loading through python script using mysqldb module.

It's truncating about 10 rows off of my 7019 row csv.

The command I'm calling: LOAD DATA LOCAL INFILE '/path/to/load.txt' REPLACE INTO TABLE tble_name FIELDS TERMINATED BY ","

When the above command is ran using the native mysql client on linux or sequel pro mysql client on mac it works fine and I get 7019 rows imported.

When the above command is ran using Python's MySQLdb module such as:

dest_cursor.execute( '''LOAD DATA LOCAL INFILE '/path/to/load.txt' REPLACE INTO TABLE tble_name FIELDS TERMINATED BY ","''' )
dest_db.commit()

Most all rows are imported but I get thrown out a slew of Warning: (1265L, "Data truncated for column '<various_column_names' at row <various_rows>")

When the warnings pop up, it states at row <row_num> but I'm not seeing that correlate to the row in the csv (I think it's the row it's trying to create on the target table, not the row in the csv) so I can't use that to help troubleshoot.

And sure enough, when it's done, my target table is missing some rows.

Unfortunately with over 7,000 rows in the csv it's hard to tell exactly which line it's choking on for further analysis . When the warnings pop up, it states at row <row_num> but I'm not seeing that correlate to the row in the csv (I think it's the row it's trying to create on the target table, not the row in the csv) so I can't use that to help troubleshoot.

There are many rows that are null and/or empty spaces but they are importing fine.

The fact that I can import the entire csv using other MySQL clients makes me feel that the MySQLdb module is not configured right or something.

This is Python 2.7 Any help is appreciated. Any ideas on how to get better visibility into which line it's choking up on would be helpful.

Answer 1

To Further help I would ask you the following.

Error Checking

After your import using any of your three ways, are there any results from running this after each run? SELECT @@GLOBAL.SQL_WARNINGS; (if so this should show you the errors, as it might be silently failing.)
What is your SQL_MODE? SELECT @@GLOBAL.SQL_MODE;
Check the file and make sure you have an even number of " 's for one.
Check the data for extra " or , 's or anything that may get caught in translation of bash/python/mysql?

Data Request

Can you provide the data for the 1st row that was missing?
Can you provide the exact script you are using?

Versions

You said your using python 2.7
What version of mysql client? SELECT @@GLOBAL.VERSION;
What version of MySQLdb?

Internationalization

Are you dealing with internationalization (汉语 Hànyǔ or русский etc. languages)?
What is the database/schema collation?

Query:

SELECT DISTINCT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME 
FROM INFORMATION_SCHEMA.SCHEMATA

WHERE (
SCHEMA_NAME <> 'sys' AND
SCHEMA_NAME <> 'mysql' AND
SCHEMA_NAME <> 'information_schema' AND
SCHEMA_NAME <> '.mysqlworkbench' AND
SCHEMA_NAME <> 'performance_schema'
);

What is the Table collation?

Query:

SELECT DISTINCT ENGINE, TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES
WHERE (
TABLE_SCHEMA <> 'sys' AND
TABLE_SCHEMA <> 'mysql' AND
TABLE_SCHEMA <> 'information_schema' AND
TABLE_SCHEMA <> '.mysqlworkbench' AND
TABLE_SCHEMA <> 'performance_schema'
);

What is the column collation?

Query:

SELECT DISTINCT CHARACTER_SET_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS
WHERE (
TABLE_SCHEMA <> 'sys' AND
TABLE_SCHEMA <> 'mysql' AND
TABLE_SCHEMA <> 'information_schema' AND
TABLE_SCHEMA <> '.mysqlworkbench' AND
TABLE_SCHEMA <> 'performance_schema'
);

Lastly

Check the Database

For connection collation/character_set

SHOW VARIABLES 
WHERE VARIABLE_NAME LIKE 'CHARACTER\_SET\_%' OR 
VARIABLE_NAME LIKE 'COLLATION%';

If the first two ways work without error then I'm leaning toward:

Other Plausible Concerns

I am not ruling out problems with any of the following:

possible python connection configuriation issues around
- python to db connection collation
- default connection timeout
- default character set error
python/bash runtime interpolation of symbols causing a random hidden gem
db collation not set to handle foreign languages
exceeding the MAX(field values)
hidden or unicode characters
emoji processing
issues with the data as i mentioned above with Double-Quotes, Commas, and I forgot to mention about NewLines for Windows or Linux (Carriage return or NewLine)

All in all there is a lot to look at and require more information to further assist.

Please update your question when you have more information and I will do the same for my answer to help you resolve your error.

Hope this helps and all goes well!

Update:

Your Error

Warning: (1265L, "Data truncated for column

Leads me to believe it is the Double-Quote around your "field terminations" Check to make sure your data does NOT have commas inside of the errored out fields. This will cause your data to shift when running command-line. As the gui is "Smart-ENOUGH" per say to deal with this. but the command-line is literal!

Answer 2

This is an embarrassing one but maybe I can help someone in the future making horrible mistakes like I have.

I spent a lot of time analyzing fields, checking for special characters, etc and it turned out I was simply causing the problem myself.

I had spaces in the csv, and NOT using a forced ENCLOSED BY in the load statement. This means I was adding a space character to some fields thus causing an overflow. So the data looked like value1, value2, value3 when it should have been value1,value2,value3 . Removing those spaces, putting quotes around the fields and enforcing ENCLOSED BY in my statement fixed this. I assume that the clients that were working were sanitizing the data behind the scenes or something. I really don't know for sure why it was working elsewhere using the same csv but that got me through the first set of hurdles.
Then after getting through that, the last line in the csv was choking and it was stating Row doesn't contain data for all columns - turns out I didn't close() the file after creating it before attempting to load it. So there was some sort of lock on the file. Once I added the close() statement and fixed the spacing issue, all the data is loading now.

Sorry for anyone that spent any measure of time looking into this issue for me.

Python MySQLdb file load truncating rows, works fine when loading file from another mysql client

Question

2 answers

solution1
2 2018-09-14 05:05:22

Error Checking

Data Request

Versions

Internationalization

Check the Database

Other Plausible Concerns

Update:

solution2
1 2018-09-14 05:32:05

Python MySQLdb file load truncating rows, works fine when loading file from another mysql client

Question

2 answers

solution1 2 2018-09-14 05:05:22

Error Checking

Data Request

Versions

Internationalization

Check the Database

Other Plausible Concerns

Update:

solution2 1 2018-09-14 05:32:05

solution1
2 2018-09-14 05:05:22

solution2
1 2018-09-14 05:32:05