简体   繁体   English


[英]Using a psycopg2 converter to retrieve bytea data from PostgreSQL

I want to store Numpy arrays in a PostgreSQL database in binary (bytea) form. 我想将Numpy数组以二进制(字节)形式存储在PostgreSQL数据库中。 I can get this to work fine in test #1 (see below), but I don't want to have to be manipulating the data arrays before inserts and after selects every time - I want to use psycopg2's adapters and converters. 我可以在测试1中使它正常工作(请参见下文),但我不想每次插入之前和每次选择之后都必须操纵数据数组-我想使用psycopg2的适配器和转换器。

Here's what I have at the moment: 这是我目前所拥有的:

import numpy as np
import psycopg2, psycopg2.extras

def my_adapter(spectrum):
    return psycopg2.Binary(spectrum)

def my_converter(my_buffer, cursor):
    return np.frombuffer(my_buffer)

class MyBinaryTest():

    # Connection info
    user = 'postgres'
    password = 'XXXXXXXXXX'
    host = 'localhost'
    database = 'test_binary'

    def __init__(self):

    def set_up(self):

        # Set up
        connection = psycopg2.connect(host=self.host, user=self.user, password=self.password)


        cursor = connection.cursor()
        try: # Clear out any old test database
            cursor.execute('drop database %s' % (self.database, ))

        cursor.execute('create database %s' % (self.database, ))

        # Direct connectly to the database and set up our table            
        self.connection = psycopg2.connect(host=self.host, user=self.user, password=self.password, database=self.database)
        self.cursor = self.connection.cursor(cursor_factory=psycopg2.extras.DictCursor)

        self.cursor.execute('''CREATE TABLE spectrum (
            "sid" integer not null primary key,
            "data" bytea not null

            CREATE SEQUENCE spectrum_id;
            ALTER TABLE spectrum
                ALTER COLUMN sid
                    SET DEFAULT NEXTVAL('spectrum_id');

    def perform_test_one(self):

        # Lets do a test

        shape = (2, 100)
        data = np.random.random(shape)

        # Binary up the data
        send_data = psycopg2.Binary(data)

        self.cursor.execute('insert into spectrum (data) values (%s) returning sid;', [send_data])

        # Retrieve the data we just inserted
        query = self.cursor.execute('select * from spectrum')
        result = self.cursor.fetchall()

        print "Type of data retrieved:", type(result[0]['data'])

        # Convert it back to a numpy array of the same shape
        retrieved_data = np.frombuffer(result[0]['data']).reshape(*shape)

        # Ensure there was no problem
        assert np.all(retrieved_data == data)
        print "Everything went swimmingly in test one!"

        return True

    def perform_test_two(self):

        if not self.use_adapters: return False

        # Lets do a test

        shape = (2, 100)
        data = np.random.random(shape)

        # No changes made to the data, as the adapter should take care of it (and it does)

        self.cursor.execute('insert into spectrum (data) values (%s) returning sid;', [data])

        # Retrieve the data we just inserted
        query = self.cursor.execute('select * from spectrum')
        result = self.cursor.fetchall()

        # No need to change the type of data, as the converter should take care of it
        # (But, we never make it here)

        retrieved_data = result[0]['data']

        # Ensure there was no problem
        assert np.all(retrieved_data == data.flatten())
        print "Everything went swimmingly in test two!"

        return True

    def setup_adapters_and_converters(self):

        # Set up test adapters
        psycopg2.extensions.register_adapter(np.ndarray, my_adapter)

        # Register our converter
        self.cursor.execute("select null::bytea;")
        my_oid = self.cursor.description[0][1]

        obj = psycopg2.extensions.new_type((my_oid, ), "numpy_array", my_converter)
        psycopg2.extensions.register_type(obj, self.connection)


        self.use_adapters = True

    def tear_down(self):

        # Tear down


        connection = psycopg2.connect(host=self.host, user=self.user, password=self.password)


        cursor = connection.cursor()
        cursor.execute('drop database %s' % (self.database, ))

test = MyBinaryTest()

Now, test #1 works fine. 现在,测试1可以正常工作。 When I take the code I have used in test 1 and setup a psycopg2 adapter and converter, it does not work (test 2). 当我拿到我在测试1中使用的代码并设置psycopg2适配器和转换器时,它不起作用(测试2)。 This is because the data being fed to the converter is not actually a buffer anymore; 这是因为被馈送到转换器的数据实际上不再是缓冲区。 it's PosgreSQL's string representation of bytea. 它是bytea的PosgreSQL的字符串表示形式。 The output is as follows: 输出如下:

In [1]: run -i test_binary.py
Type of data retrieved: type 'buffer'>
Everything went swimmingly in test one!
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (273, 0))

ValueError                                Traceback (most recent call last)

/Users/andycasey/thesis/scope/scope/test_binary.py in <module>()
    155 test.perform_test_one()
    156 test.setup_adapters_and_converters()
--> 157 test.perform_test_two()
    158 test.tear_down()

/Users/andycasey/thesis/scope/scope/test_binary.py in perform_test_two(self)
    101         # Retrieve the data we just inserted

    102         query = self.cursor.execute('select * from spectrum')
--> 103         result = self.cursor.fetchall()
    105         # No need to change the type of data, as the converter should take care of it

/Library/Python/2.6/site-packages/psycopg2/extras.pyc in fetchall(self)
     81     def fetchall(self):
     82         if self._prefetch:
---> 83             res = _cursor.fetchall(self)
     84         if self._query_executed:
     85             self._build_index()

/Users/andycasey/thesis/scope/scope/test_binary.py in my_converter(my_buffer, cursor)
      8 def my_converter(my_buffer, cursor):
----> 9     return np.frombuffer(my_buffer)

ValueError: buffer size must be a multiple of element size
WARNING: Failure executing file: <test_binary.py>

In [2]: %debug
> /Users/andycasey/thesis/scope/scope/test_binary.py(9)my_converter()
      8 def my_converter(my_buffer, cursor):
----> 9     return np.frombuffer(my_buffer)

ipdb> my_buffer

Does anyone know how I can either (a) de-serialize the string representation coming back to me in my_converter so I return a Numpy array each time, or (b) force PostgreSQL/psycopg2 to send the buffer representation to the converter (which I can use) instead of the string representation? 有谁知道我怎么能(a)在my_converter中反序列化返回给我的字符串表示形式,以便每次都返回一个Numpy数组,或者(b)强制PostgreSQL / psycopg2将缓冲区表示形式发送到转换器(我可以使用)而不是字符串表示形式?

Thanks! 谢谢!

I'm on OS X 10.6.8 with Python 2.6.1 (r261:67515), PostgreSQL 9.0.3 and psycopg2 2.4 (dt dec pq3 ext) 我在使用Python 2.6.1(r261:67515),PostgreSQL 9.0.3和psycopg2 2.4(dt dec pq3 ext)的OS X 10.6.8

The format you see in the debugger is easy to parse: it is PostgreSQL hex binary format (http://www.postgresql.org/docs/9.1/static/datatype-binary.html). 在调试器中看到的格式很容易解析:它是PostgreSQL十六进制二进制格式(http://www.postgresql.org/docs/9.1/static/datatype-binary.html)。 psycopg can parse that format and return a buffer containing the data; psycopg可以解析该格式并返回包含数据的缓冲区; you can use that buffer to obtain an array. 您可以使用该缓冲区获取数组。 Instead of writing a typecaster from scratch, write one invoking the original func and postprocess its result. 不必从头开始编写类型转换程序,而是编写一个调用原始函数并对其结果进行后处理的程序。 Sorry but I can't remember its name now and I'm writing from a mobile: you may get further help from the mailing list. 抱歉,我现在不记得它的名字了,我正在用手机写信:您可能会从邮件列表中获得更多帮助。

Edit: complete solution. 编辑:完整的解决方案。

The default bytea typecaster (which is the object that can parse the postgres binary representation and return a buffer object out of it) is psycopg2.BINARY. 默认的bytea typecaster(可以解析postgres二进制表示形式并从中返回缓冲区对象的对象)是psycopg2.BINARY。 We can use it to create a typecaster converting to array instead: 我们可以使用它来创建一个转换为数组的类型转换程序:

In [1]: import psycopg2

In [2]: import numpy as np

In [3]: a = np.eye(3)

In [4]: a
array([[ 1.,  0.,  0.],
      [ 0.,  1.,  0.],
      [ 0.,  0.,  1.]])

In [5]: cnn = psycopg2.connect('')

# The adapter: converts from python to postgres
# note: this only works on numpy version whose arrays 
# support the buffer protocol,
# e.g. it works on 1.5.1 but not on 1.0.4 on my tests.

In [12]: def adapt_array(a):
  ....:     return psycopg2.Binary(a)

In [13]: psycopg2.extensions.register_adapter(np.ndarray, adapt_array)

# The typecaster: from postgres to python

In [21]: def typecast_array(data, cur):
  ....:     if data is None: return None
  ....:     buf = psycopg2.BINARY(data, cur)
  ....:     return np.frombuffer(buf)

In [24]: ARRAY = psycopg2.extensions.new_type(psycopg2.BINARY.values,
'ARRAY', typecast_array)

In [25]: psycopg2.extensions.register_type(ARRAY)

# Now it works "as expected"

In [26]: cur = cnn.cursor()

In [27]: cur.execute("select %s", (a,))

In [28]: cur.fetchone()[0]
Out[28]: array([ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.])

As you know, np.frombuffer(a) loses the array shape, so you will have to figure out a way to preserve it. 如您所知,np.frombuffer(a)失去了数组形状,因此您将必须找出一种保留它的方法。

pFor the case of numpy arrays one can avoid the buffer strategy with all its disadvantages like the loss of shape and data type. p对于numpy数组,可以避免缓冲区策略,但它有很多缺点,例如形状和数据类型的丢失。 Following a stackoverflow question about storing a numpy array in sqlite3 one can easily adapt the approach for postgres. 在关于在sqlite3中存储一个numpy数组的stackoverflow问题之后,可以轻松地将方法应用于postgres。

import os
import psycopg2 as psql
import numpy as np

# converts from python to postgres
def _adapt_array(text):
    out = io.BytesIO()
    np.save(out, text)
    return psql.Binary(out.read())

# converts from postgres to python
def _typecast_array(value, cur):
    if value is None:
        return None

    data = psql.BINARY(value, cur)
    bdata = io.BytesIO(data)
    return np.load(bdata)

con = psql.connect('')

psql.extensions.register_adapter(np.ndarray, _adapt_array)
t_array = psql.extensions.new_type(psql.BINARY.values, "numpy", _typecast_array)

cur = con.cursor()

Now one can create and fill a table (with a defined as in the previous post) 现在,可以创建并填充表格(定义与上a文章相同)

cur.execute("create table test (column BYTEA)")
cur.execute("insert into test values(%s)", (a,))

And restore the numpy object 并还原numpy对象

cur.execute("select * from test")

Result: 结果:

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

I tried both of these answers and couldn't get them to work until I modified Daniel's code to use np.savetxt and np.loadtxt and changed his typecaster 我尝试了这两个答案,直到我修改了Daniel的代码以使用np.savetxtnp.loadtxt并更改了他的类型转换程序后,它们才能正常工作

bdata = BytesIO(data[1:-1])

so the two functions now look like 所以这两个函数现在看起来像

def _adapt_array(arr):
    out = BytesIO()
    np.savetxt(out, arr, fmt='%.2f')
    return pg2.Binary(out.read())

def _typecast_array(value, cur):
    if value is None:
       return None
    data = pg2.BINARY(value, cur)
    bdata = BytesIO(data[1:-1])
    return np.loadtxt(bdata)

pg2.extensions.register_adapter(np.ndarray, _adapt_array)
t_array = pg2.extensions.new_type(pg2.BINARY.values, 'numpy', _typecast_array)

The error I was getting was could not convert string to float: '[473.07' . 我遇到的错误是could not convert string to float: '[473.07' I suspect this fix will only work for flat arrays but that's how my data was structured so it worked for me. 我怀疑此修复程序仅适用于平面阵列,但这就是我的数据的结构方式,因此对我有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM