Python 3 - 如何從 SQL 數據庫中提取數據並逐行處理數據和 append 到 pandas dataframe？

Question

我有一個 MySQL 數據庫，它的列是：

+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int unsigned | NO   | PRI | NULL    | auto_increment |
| artist       | text         | YES  |     | NULL    |                |
| title        | text         | YES  |     | NULL    |                |
| album        | text         | YES  |     | NULL    |                |
| duration     | text         | YES  |     | NULL    |                |
| artistlink   | text         | YES  |     | NULL    |                |
| songlink     | text         | YES  |     | NULL    |                |
| albumlink    | text         | YES  |     | NULL    |                |
| instrumental | tinyint(1)   | NO   |     | 0       |                |
| downloaded   | tinyint(1)   | NO   |     | 0       |                |
| filepath     | text         | YES  |     | NULL    |                |
| language     | json         | YES  |     | NULL    |                |
| genre        | json         | YES  |     | NULL    |                |
| style        | json         | YES  |     | NULL    |                |
| artistgender | text         | YES  |     | NULL    |                |
+--------------+--------------+------+-----+---------+----------------+

我需要從中提取數據並處理數據並將數據添加到 pandas DataFrame。

我知道如何從 SQL 數據庫中提取數據，並且我已經實現了一種將數據傳遞給 DataFrame 的方法，但是速度非常慢（大約 30 秒），而當我使用一個簡單的命名元組列表時，操作速度非常快（不到 3 秒）。

具體來說，文件路徑默認為 NULL 除非文件已下載（目前沒有下載任何歌曲），當 Python 獲取文件路徑時，該值將為 None，我需要該值變為'' 。

因為 MySQL 沒有 BOOLEAN 類型，所以我需要將收到的int轉換為bool 。

並且語言，流派，樣式字段是存儲為 JSON 列表的標簽，它們當前都是 NULL，當 Python 獲取它們時它們是字符串，我需要使用json.loads使它們成為list ，除非它們是 None，如果它們是無我需要 append 空列表。

這是我對問題的低效解決方案：

import json
import mysql.connector
from pandas import *

fields = {
    "artist": str(),
    "album": str(),
    "title": str(),
    "id": int(),
    "duration": str(),
    "instrumental": bool(),
    "downloaded": bool(),
    "filepath": str(),
    "language": list(),
    "genre": list(),
    "style": list(),
    "artistgender": str(),
    "artistlink": str(),
    "albumlink": str(),
    "songlink": str(),
}

conn = mysql.connector.connect(
    user="Estranger", password=PWD, host="127.0.0.1", port=3306, database="Music"
)
cursor = conn.cursor()

def proper(x):
    return x[0].upper() + x[1:]

def fetchdata():
    cursor.execute("select {} from songs".format(', '.join(list(fields))))
    data = cursor.fetchall()
    dataframes = list()
    for item in data:
        entry = list(map(proper, item[0:3]))
        entry += [item[3]]
        for j in range(4, 7):
            cell = item[j]
            if isinstance(cell, int):
                entry.append(bool(cell))
            elif isinstance(cell, str):
                entry.append(cell)
        if item[7] is not None:
            entry.append(item[7])
        else:
            entry.append('')
        for j in range(8, 11):
            entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
        entry.append(item[11])
        entry += item[12:15]
        df = DataFrame(fields, index=[])
        row = Series(entry, index = df.columns)
        df = df.append(row, ignore_index=True)
        dataframes.append(df)
    songs = concat(dataframes, axis=0, ignore_index=True)
    songs.sort_values(['artist', 'album', 'title'], inplace=True)
    return songs

目前數據庫中有 4464 首歌曲，代碼大約需要 30 秒才能完成。

我按藝術家和標題對我的 SQL 數據庫進行了排序，我需要按藝術家、專輯和標題對 QTreeWidget 的條目進行排序，而 MySQL 對數據的排序與 Python 不同，我更喜歡 Python 排序。

在我的測試中， df.loc和df = df.append()方法很慢， pd.concat很快，但我真的不知道如何創建只有一行的數據幀並將平面列表傳遞給 dataframe 而不是字典，如果有比pd.concat更快的方法，或者 for 循環中的操作是否可以向量化。

如何改進我的代碼？

我想出了如何使用列表列表創建 DataFrame 並指定列名的方法，而且速度非常快，但我仍然不知道如何優雅地指定數據類型而不會引發代碼錯誤...

def fetchdata():                                                                          
    cursor.execute("select {} from songs".format(', '.join(list(fields))))                
    data = cursor.fetchall()                                                              
    for i, item in enumerate(data):                                                       
        entry = list(map(proper, item[0:3]))                                              
        entry += [item[3]]                                                                
        for j in range(4, 7):                                                             
            cell = item[j]                                                                
            if isinstance(cell, int):                                                     
                entry.append(bool(cell))                                                  
            elif isinstance(cell, str):                                                   
                entry.append(cell)                                                        
        if item[7] is not None:                                                           
            entry.append(item[7])                                                         
        else:                                                                             
            entry.append('')                                                              
        for j in range(8, 11):                                                            
            entry.append(json.loads(item[j])) if item[j] is not None else entry.append([])
        entry.append(item[11])                                                            
        entry += item[12:15]                                                              
        data[i] = entry                                                                   
    songs = DataFrame(data, columns=list(fields), index=range(len(data)))               
    songs.sort_values(['artist', 'album', 'title'], inplace=True)                         
    return songs

而且我仍然需要類型轉換，它們已經非常快了，但看起來並不優雅。

Answer 1

您可以為每列創建一個轉換函數：

funcs = [
    str.capitalize,
    str.capitalize,
    str.capitalize,
    int,
    str,
    bool,
    bool,
    lambda v: v if v is not None else '',
    lambda v: json.loads(v) if v is not None else [],
    lambda v: json.loads(v) if v is not None else [],
    lambda v: json.loads(v) if v is not None else [],
    str,
    str,
    str,
    str,
]

現在您可以應用轉換每個字段的值的函數

for i, item in enumerate(data):
    row = [func(field) for field, func in zip(item, funcs)]
    data[i] = row

Answer 2

對於問題的第一部分，對於通用數據庫“歷史”：

    import pymysql
    # open database
    connection = pymysql.connect("localhost","root","123456","blue" )
    # prepare a cursor object using cursor() method
    cursor = connection.cursor()
    # prepare SQL command
    sql = "SELECT * FROM history" 
    try:
        cursor.execute(sql)
        data = cursor.fetchall()
        print ("Last row uploaded",list(data[-1]))
    except:
        print ("Error: unable to fetch data")
    # disconnect from server
    connection.close()

Answer 3

You can simply fetch data from the table and create a Data-frame using Pandas.

import pymysql
import pandas as pd
from pymysql import Error
conn = pymysql.connect(host="",user="",connect_timeout=10,password="",database="",port=)
if conn:
    cursor = conn.cursor()
    sql = f"""SELECT * FROM schema.table_name;"""
    cursor.execute(sql)
    data =pd.DataFrame(cursor.fetchall())
    conn.close()
# You can go ahead and create a csv from this Data-Frame
    csv_gen = pd.to_csv(data,index=False)
    
    
 

    enter code here

Python 3 - 如何從 SQL 數據庫中提取數據並逐行處理數據和 append 到 pandas dataframe？

問題描述

3 個解決方案

解決方案1
1 已采納 2021-07-29 12:47:11

解決方案2
0 2022-03-30 17:14:05

解決方案3
0 2022-11-15 03:13:42

Python 3 - 如何從 SQL 數據庫中提取數據並逐行處理數據和 append 到 pandas dataframe？

問題描述

3 個解決方案

解決方案1 1 已采納 2021-07-29 12:47:11

解決方案2 0 2022-03-30 17:14:05

解決方案3 0 2022-11-15 03:13:42

解決方案1
1 已采納 2021-07-29 12:47:11

解決方案2
0 2022-03-30 17:14:05

解決方案3
0 2022-11-15 03:13:42