简体   繁体   English

pandas to_sql截断我的数据

[英]pandas to_sql truncates my data

I was using df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') to export a data frame into mysql. 我使用df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')将数据框导出到mysql中。 However, I discovered that the columns with long string content (such as url) is truncated to 63 digits. 但是,我发现具有长字符串内容的列(例如url)被截断为63位数。 I received the following warning from ipython notebook when I exported: 我在导出时从ipython笔记本收到以下警告:

/usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248: Warning: Data truncated for column 'url' at row 3 cur.executemany(insert_query, data) /usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248:警告:第3行的列'url'截断数据cur.executemany(insert_query,data)

There were other warnings in the same style for different rows. 对于不同的行,还存在相同样式的其他警告。

Is there anything I can tweak to export the full data properly? 有什么我可以调整以正确导出完整数据吗? I could set up the correct data schema in mysql and then export to that. 我可以在mysql中设置正确的数据模式,然后导出到该模式。 But I am hoping a tweak can just make it work straight from python. 但是我希望调整可以让它直接从python中运行。

If you are using pandas 0.13.1 or older , this limit of 63 digits is indeed hardcoded, because of this line in the code: https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L278 如果您使用的是0.13.1或更早的 pandas,这个63位数的限制确实是硬编码的,因为代码中的这一行: https//github.com/pydata/pandas/blob/v0.13.1/pandas/io/ sql.py#L278

As a workaround, you could maybe monkeypatch that function get_sqltype : 作为一种解决方法,你可能monkeypatch函数get_sqltype

from pandas.io import sql

def get_sqltype(pytype, flavor):
    sqltype = {'mysql': 'VARCHAR (63)',    # <-- change this value to something sufficient higher
               'sqlite': 'TEXT'}

    if issubclass(pytype, np.floating):
        sqltype['mysql'] = 'FLOAT'
        sqltype['sqlite'] = 'REAL'
    if issubclass(pytype, np.integer):
        sqltype['mysql'] = 'BIGINT'
        sqltype['sqlite'] = 'INTEGER'
    if issubclass(pytype, np.datetime64) or pytype is datetime:
        sqltype['mysql'] = 'DATETIME'
        sqltype['sqlite'] = 'TIMESTAMP'
    if pytype is datetime.date:
        sqltype['mysql'] = 'DATE'
        sqltype['sqlite'] = 'TIMESTAMP'
    if issubclass(pytype, np.bool_):
        sqltype['sqlite'] = 'INTEGER'

    return sqltype[flavor]

sql.get_sqltype = get_sqltype

And then just using your code should work: 然后只需使用您的代码应该工作:

df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')

Starting from pandas 0.14 , the sql module is uses sqlalchemy under the hood, and strings are converted to the sqlalchemy TEXT type, wich is converted to the mysql TEXT type (and not VARCHAR ), and this will also allow you to store larger strings than 63 digits: 从pandas 0.14开始,sql模块在底层使用sqlalchemy,字符串转换为sqlalchemy TEXT类型,转换为mysql TEXT类型(而不是VARCHAR ),这也允许你存储比63位数:

engine = sqlalchemy.create_engine('mysql://scott:tiger@localhost/foo')
df.to_sql('testdata', engine, if_exists='replace')

Only if you still use the a DBAPI connection instead of a sqlalchemy engine, the issue remains, but this option is deprecated and it is recommended to provide an sqlalchemy engine to to_sql . 只有当您仍然使用DBAPI连接而不是sqlalchemy引擎时,问题仍然存在,但不推荐使用此选项,建议为to_sql提供to_sql引擎。

Inspired by @joris's answer, I decided to hard code the change into the panda's source and re-compile. 受@ joris的回答启发,我决定将更改硬编码到熊猫的源代码并重新编译。

cd /usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io
sudo pico sql.py

changed line 871 改变了第871

'mysql': 'VARCHAR (63)',

to

'mysql': 'VARCHAR (255)',

then recompiled the just that file 然后重新编译该文件

sudo python -m py_compile sql.py

restarted my script and _to_sql() function wrote a table. 重启我的脚本和_to_sql()函数写了一个表。 (I expected that the recompile would have broken pandas, but seems to have not.) (我预计重组会破坏熊猫,但似乎没有。)

here is my script to write a dataframe to mysql, for reference. 这是我的脚本,将数据帧写入mysql,以供参考。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlalchemy 
from sqlalchemy import create_engine
df = pd.read_csv('10k.csv')
## ... dataframe munging
df = df.where(pd.notnull(df), None) # workaround for NaN bug
engine = create_engine('mysql://user:password@localhost:3306/dbname')
con = engine.connect().connection
df.to_sql("issues", con, 'mysql', if_exists='replace', index=True, index_label=None)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM