简体   繁体   English

python中的关系数据结构

[英]Relational data structure in python

I'm looking for a SQL-relational-table-like data structure in python, or some hints for implementing one if none already exist. 我正在寻找python中类似SQL-relational-table的数据结构,或者如果没有一个实现,则寻求一些实现的提示。 Conceptually, the data structure is a set of objects (any objects), which supports efficient lookups/filtering (possibly using SQL-like indexing). 从概念上讲,数据结构是一组对象(任何对象),它支持有效的查找/过滤(可能使用类似SQL的索引)。

For example, lets say my objects all have properties A , B , and C , which I need to filter by, hence I define the data should be indexed by them. 例如,假设我的对象都具有属性ABC ,我需要对其进行过滤,因此我定义了数据应该由它们索引。 The objects may contain lots of other members, which are not used for filtering. 这些对象可能包含许多其他成员,这些成员不用于过滤。 The data structure should support operations equivalent to SELECT <obj> from <DATASTRUCTURE> where A=100 (same for B and C ). 数据结构应支持SELECT <obj> from <DATASTRUCTURE> where A=100等效的操作, SELECT <obj> from <DATASTRUCTURE> where A=100BC相同)。 It should also be possible to filter by more than one field ( where A=100 and B='bar' ). 还应该可以过滤多个字段( where A=100 and B='bar' )。

The requirements are: 要求是:

  1. Should support a large number of items (~200K). 应该支持大量项目(〜200K)。 The items must be the objects themselves, and not some flattened version of them (which rules out sqlite and likely pandas ). 这些项目必须是对象本身,而不是对象的某些拼合版本(排除了sqlite和可能的pandas )。
  2. Insertion should be fast, should avoid reallocation of memory (which pretty much rules out pandas ) 插入应该很快,应该避免重新分配内存(这几乎排除了pandas
  3. Should support simple filtering (like the example above), which must be more efficient than O(len(DATA)) , ie avoid "full table scans". 应该支持简单过滤(例如上面的示例),该过滤必须比O(len(DATA))更有效,即避免“全表扫描”。

Does such data structure exist? 是否存在这样的数据结构?


Please don't suggest using sqlite. 请不要建议使用sqlite。 I'd need to repeatedly convert object->row and row->object, which is time consuming and cumbersome since my objects are not necessarily flat-ish. 我需要反复转换object-> row和row-> object,这很耗时又麻烦,因为我的对象不一定是扁平的。

Also, please don't suggest using pandas because repeated insertions of rows is too slow as it may requires frequent reallocation. 另外,请勿建议使用熊猫,因为重复插入行太慢,因为它可能需要频繁地重新分配。

So long as you don't have any duplicates on (a,b,c) you could sub-class dict, enter your objects indexed by the tuple(a,b,c), and define your filter method (probably a generator) to return all entries that match your criteria. 只要在(a,b,c)上没有任何重复项,就可以对dict进行子类化,输入由元组(a,b,c)索引的对象,并定义您的过滤器方法(可能是生成器)返回符合您条件的所有条目。

class mydict(dict):
    def filter(self,a=None, b=None, c=None):
        for key,obj in enumerate(self):
            if (a and (key[0] == a)) or not a:
                if (b and (key[1] == b)) or not b:
                    if (c and (key[2] == c)) or not c:
                        yield obj

that is an ugly and very inefficient example, but you get the idea. 这是一个丑陋且非常低效的示例,但是您明白了。 I'm sure there is a better implementation method in itertools, or something. 我确定itertools中有更好的实现方法。

edit: 编辑:

I kept thinking about this. 我一直在想这个。 I toyed around with it some last night and came up with storing the objects in a list and storing dictionaries of the indexes by the desired keyfields. 昨晚我玩弄它,想出了将对象存储在列表中,并按所需的关键字段存储索引的字典。 Retrieve objects by taking the intersection of the indexes for all specified criteria. 通过获取所有指定条件的索引交集来检索对象。 Like this: 像这样:

objs = []
aindex = {}
bindex = {}
cindex = {}

def insertobj(a,b,c,obj):
    idx = len(objs)
    objs.append(obj)
    if a in aindex:
        aindex[a].append(idx)
    else:
        aindex[a] = [idx]

    if b in bindex: 
        bindex[b].append(idx)
    else:
        bindex[b] = [idx]

    if c in cindex:
        cindex[c].append(idx)
    else :
        cindex[c] = [idx]

def filterobjs(a=None,b=None,c=None):
    if a : aset = set(aindex[a])
    if b : bset = set(bindex[b])
    if c : cset = set(cindex[c])
    result = set(range(len(objs)))
    if a and aset : result = result.intersection(aset)
    if b and bset : result = result.intersection(bset)
    if c and cset : result = result.intersection(cset)
    for idx in result:
        yield objs[idx]

class testobj(object):
    def __init__(self,a,b,c):
        self.a = a
        self.b = b
        self.c = c

    def show(self):
        print ('a=%i\tb=%i\tc=%s'%(self.a,self.b,self.c))

if __name__ == '__main__':
    for a in range(20):
        for b in range(5):
            for c in ['one','two','three','four']:
                insertobj(a,b,c,testobj(a,b,c))

    for obj in filterobjs(a=5):
        obj.show()
    print()
    for obj in filterobjs(b=3):
        obj.show()
    print()
    for obj in filterobjs(a=8,c='one'):
        obj.show()

it should be reasonably quick, although the objects are in a list, they are accessed directly by index. 尽管对象在列表中,但是应该通过索引直接访问它们,这应该相当快。 The "searching" is done on a hashed dict. “搜索”是在散列字典上完成的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM