簡體   English   中英

Python / MySQLdb從CSV導入數據封裝

[英]Python/MySQLdb import from CSV with data encapsulation

我一直在使用WAMP來攝取一些csv日志,並希望通過編寫我需要采取的一些常規操作來轉移到更自動化的過程。

我在PHPmyadmin中使用直接CSV導入功能來處理CSV的方言和細節。

我在Python中編寫了一個上傳器,使用MySQLdb來解析日志文件,但是由於日志包含一些無用的字符,我發現我需要做很多運行在我可能不想要的清理輸入。 。

例如,日志是來自目錄掃描程序的一些數據,我無法控制人們使用的文件夾命名約定。 我有這個文件夾: -

"C:\user\NZ Business Roundtable_Download_13Feb2013, 400 Access"

並且, char被讀作一個新的字段標記(畢竟它是csv)。 我真正想要它做的是忽略引號內的所有文字: - "......"

我看到了類似問題的'字符,我相信會有更多。

我發現了這個: - http://www.tech-recipes.com/rx/2345/import_csv_file_directly_into_mysql/ ,它顯示了我如何編寫Python腳本以像PHPmyadmin加載例程一樣運行。 主要使用此代碼段:

load data local infile 'uniq.csv' into table tblUniq fields terminated by ','
enclosed by '"'
lines terminated by '\n'
(uniqName, uniqCity, uniqComments)

然而,我希望保護的表格有一些深入的處理和更改,我已經編寫了腳本,所以想知道是否有辦法“告訴”MySQL我想用""作為文本封裝。 我想要保護的主要處理是在創建新表時給它一個特定的表名,並在整個剩余的處理過程中使用它。

我的表制作者腳本示例: -

def make_table(self):
    query ="DROP TABLE IF EXISTS `atl`.`{}`".format(self.table)
    self.cur.execute(query)
    query = "CREATE TABLE IF NOT EXISTS `atl`.`{}` (`PK` INT NOT NULL AUTO_INCREMENT PRIMARY KEY, `ID` varchar(10), `PARENT_ID` varchar(10), `URI` varchar(284), \
        `FILE_PATH` varchar(230), `NAME` varchar(125), `METHOD` varchar(9), `STATUS` varchar(4), `SIZE` varchar(9), \
        `TYPE` varchar(9), `EXT` varchar(11), `LAST_MODIFIED` varchar(19), `EXTENSION_MISMATCH` varchar(20), `MD5_HASH` varchar(32), \
        `FORMAT_COUNT` varchar(2), `PUID` varchar(9), `MIME_TYPE` varchar(71), `FORMAT_NAME` varchar(59), `FORMAT_VERSION` varchar(7), \
        `delete_flag` tinyint, `delete_reason` VARCHAR(80), `move_flag` TINYINT, `move_reason` VARCHAR(80), \
        `ext_change_flag` TINYINT, `ext_change_reason` VARCHAR(80), `ext_change_value` VARCHAR(4), `fname_change_flag` TINYINT, `fname_change_reason` VARCHAR(80),\
        `fname_change_value` VARCHAR(80))".format(self.table)
    self.cur.execute(query)
    self.mydb.commit()

我的攝取腳本示例: -

 def ingest_row(self, row):
    query = "insert"
    # Prepare SQL query to INSERT a record into the database.
    query = "INSERT INTO `atl`.`{0}` (`ID`, `PARENT_ID`, `URI`, `FILE_PATH`, `NAME`, `METHOD`, `STATUS`, `SIZE`, `TYPE`, `EXT`, \
        `EXTENSION_MISMATCH`, `LAST_MODIFIED`, `MD5_HASH`, `FORMAT_COUNT`, `PUID`, `MIME_TYPE`, `FORMAT_NAME`,  `FORMAT_VERSION`) \
        VALUES ('{1}','{2}','{3}','{4}','{5}','{6}','{7}','{8}','{9}','{10}','{11}','{12}','{13}','{14}','{15}','{16}','{17}','{18}')".format(self.table, row[0], row[1], row[2], row[3], row[4], \
         row[5], row[6], row[7], row[8], row[9], row[10], row[11], row[12], row[13], row[14], row[15], row[16], row[17])
    try:
        self.cur.execute(query)
        self.mydb.commit()
    except:
        print query
        quit()

日志示例: -

"ID","PARENT_ID","URI","FILE_PATH","NAME","METHOD","STATUS","SIZE","TYPE","EXT","LAST_MODIFIED","EXTENSION_MISMATCH","MD5_HASH","FORMAT_COUNT","PUID","MIME_TYPE","FORMAT_NAME","FORMAT_VERSION"
"1","","file:/C:/jay/NZ%20Business%20Roundtable_Download_13Feb2013,%20400%20Access/","C:\jay\NZ Business Roundtable_Download_13Feb2013, 400 Access","NZ Business Roundtable_Download_13Feb2013, 400 Access",,"Done","","Folder",,"2013-06-28T11:31:36","false",,"",,"","",""
"2","1","file:/C:/jay/NZ%20Business%20Roundtable_Download_13Feb2013,%20400%20Access/1993/","C:\jay\NZ Business Roundtable_Download_13Feb2013, 400 Access\1993","1993",,"Done","","Folder",,"2013-06-28T11:31:36","false",,"",,"","",""

永遠不要使用字符串格式,串聯等來構建SQL查詢!

dbapi要求所有驅動程序都支持參數化查詢,應該將參數提供給游標的execute方法。 對於MySQLdb ,whch支持格式樣式參數化,它看起來像:

cursor.execute('insert into sometable values (%s, %s)', ('spam', 'eggs'))

所提供的參數由庫正確轉義,因此,如果您的字符串包含必須轉義的字符,則無關緊要。

在你的特殊情況下唯一的例外是表名,因為轉義會產生非法的sql。

您應該使用SQL prepared statements 將數據和SQL代碼與format混合打開了SQL injection的大門( 在前25個軟件缺陷/安全問題中幾乎總是第一位 )。


例如,這是您的數據:

>>> log = """\
... "ID","PARENT_ID","URI","FILE_PATH","NAME","METHOD","STATUS","SIZE","TYPE","EXT","LAST_MODIFIED","EXTENSION_MISMATCH","MD5_HASH","FORMAT_COUNT","PUID","MIME_TYPE","FORMAT_NAME","FORMAT_VERSION"
... "1","","file:/C:/jay/NZ%20Business%20Roundtable_Download_13Feb2013,%20400%20Access/","C:\jay\NZ Business Roundtable_Download_13Feb2013, 400 Access","NZ Business Roundtable_Download_13Feb2013, 400 Access",,"Done","","Folder",,"2013-06-28T11:31:36","false",,"",,"","",""
... "2","1","file:/C:/jay/NZ%20Business%20Roundtable_Download_13Feb2013,%20400%20Access/1993/","C:\jay\NZ Business Roundtable_Download_13Feb2013, 400 Access\1993","1993",,"Done","","Folder",,"2013-06-28T11:31:36","false",,"",,"","",""
... """

我沒有這個文件,所以讓我假裝:

>>> import StringIO
>>> logfile = StringIO.StringIO(log)

然后讓我們構建查詢:

>>> import csv
>>> csvreader = csv.reader(logfile)
>>> fields = csvreader.next()
>>> 
>>> table = 'mytable'
>>> 
>>> fields_fmt = ', '.join([ '`%s`' % f for f in fields ])
>>> values_fmt = ', '.join(['%s'] * len(fields))
>>> query = "INSERT INTO `atl`.`{0}` ({1}) VALUES ({2})".format(
... #        self.table, fields_fmt, values_fmt)
...         table, fields_fmt, values_fmt)
>>> query
'INSERT INTO `atl`.`mytable` (`ID`, `PARENT_ID`, `URI`, `FILE_PATH`, `NAME`, `METHOD`, `STATUS`, `SIZE`, `TYPE`, `EXT`, `LAST_MODIFIED`, `EXTENSION_MISMATCH`, `MD5_HASH`, `FORMAT_COUNT`, `PUID`, `MIME_TYPE`, `FORMAT_NAME`, `FORMAT_VERSION`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'

那么如果你按摩ingest_row

def ingest_row(self, row):
    try:
        self.cur.execute(query, row)
        self.mydb.commit()
    except:
        print query
        quit()

然后,您可以導入數據:

for row in csvreader:
    ingest_row(row)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM