將 MySQL 結果集轉換為 NumPy 數組的最有效方法是什么？

Question

我正在使用 MySQLdb 和 Python。 我有一些基本的查詢，例如：

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

我需要“結果”成為 NumPy 陣列，並且我希望通過 memory 消耗來節省成本。 似乎逐行復制數據效率極低（需要雙倍的 memory）。 有沒有更好的方法將 MySQLdb 查詢結果轉換為 NumPy 數組格式？

我希望使用 NumPy 數組格式的原因是因為我希望能夠輕松地對數據進行切片和切塊，而 python 在這方面似乎對多維 ZA3CBC3F9D0CE2F2C1554E1B67 非常友好。

e.g. b = a[a[:,2]==1]

謝謝！

Answer 1

該方案使用了 Kieth 的fromiter技術，但是對 SQL 結果的二維表結構處理更加直觀。 此外，它通過避免 python 數據類型中的所有重塑和展平來改進 Doug 的方法。 使用結構化數組，我們幾乎可以直接從 MySQL 結果讀取到 numpy，幾乎完全刪除 python 數據類型。 我說“幾乎”是因為fetchall迭代器仍然產生 python 元組。

雖然有一個警告，但這並不是什么大問題。 您必須提前知道列的數據類型和行數。

知道列類型應該很明顯，因為您知道查詢大概是什么，否則您始終可以使用 curs.description 和 MySQLdb.FIELD_TYPE.* 常量的 map。

知道行數意味着您必須使用客戶端 cursor（這是默認設置）。 我對 MySQLdb 和 MySQL 客戶端庫的內部結構知之甚少，但我的理解是，在使用客戶端游標時，整個結果都被提取到客戶端 memory 中，盡管我懷疑實際上涉及一些緩沖和緩存。 This would mean using double memory for the result, once for the cursor copy and once for the array copy, so it's probably a good idea to close the cursor as soon as possible to free up the memory if the result set is large.

嚴格來說，您不必提前提供行數，但這樣做意味着數組 memory 會提前分配一次，而不是隨着更多行從迭代器進來而不斷調整大小，這意味着提供巨大的性能提升。

有了這個，一些代碼

import MySQLdb
import numpy

conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

有關如何指定列數據類型和列名的信息，請參閱 numpy 文檔以獲取 dtype 和上面有關結構化 arrays 的鏈接。

Answer 2

fetchall方法實際上返回一個迭代器，numpy 有fromiter方法從一個迭代器初始化一個數組。 因此，根據表中的數據，您可以輕松地將兩者結合起來，或者使用適配器生成器。

Answer 3

NumPy 的fromiter方法在這里似乎最好（如 Keith 的回答，在此之前）。

使用fromiter將通過調用 MySQLdb cursor 方法返回的結果集重新轉換為 NumPy 數組很簡單，但有幾個細節可能值得一提。

import numpy as NP
import MySQLdb as SQL

cxn = SQL.connect('localhost', 'some_user', 'their_password', 'db_name')
c = cxn.cursor()
c.execute('SELECT id, ratings from video')

# fetchall() returns a nested tuple (one tuple for each table row)
results = cursor.fetchall()

# 'num_rows' needed to reshape the 1D NumPy array returend by 'fromiter' 
# in other words, to restore original dimensions of the results set
num_rows = int(c.rowcount)

# recast this nested tuple to a python list and flatten it so it's a proper iterable:
x = map(list, list(results))              # change the type
x = sum(x, [])                            # flatten

# D is a 1D NumPy array
D = NP.fromiter(iterable=x, dtype=float, count=-1)  

# 'restore' the original dimensions of the result set:
D = D.reshape(num_rows, -1)

注意fromiter返回一個一維NumPY 數組，

（當然，這是有道理的，因為您可以使用fromiter通過傳遞count的參數來僅返回單個 MySQL 表行的一部分）。

盡管如此，您仍必須恢復 2D 形狀，因此謂詞調用 cursor 方法rowcount 。 以及在最后一行中對reshape的后續調用。

最后，參數count的默認參數是'-1'，它只是檢索整個iterable

將 MySQL 結果集轉換為 NumPy 數組的最有效方法是什么？

問題描述

3 個解決方案

解決方案1
23 2013-08-15 17:26:50

解決方案2
15 已采納 2011-08-15 05:49:01

解決方案3
6 2011-08-15 06:51:21

將 MySQL 結果集轉換為 NumPy 數組的最有效方法是什么？

問題描述

3 個解決方案

解決方案1 23 2013-08-15 17:26:50

解決方案2 15 已采納 2011-08-15 05:49:01

解決方案3 6 2011-08-15 06:51:21

解決方案1
23 2013-08-15 17:26:50

解決方案2
15 已采納 2011-08-15 05:49:01

解決方案3
6 2011-08-15 06:51:21