简体   繁体   English

pytables在查询不匹配的字符串时变慢

[英]Pytables slow on query for non-matching string

I'm relatively new in python and I'm using pytables to store some genomic annotations in hdf for faster query. 我在python中相对较新,我正在使用pytables在hdf中存储一些基因组注释,以加快查询速度。 I find querying a non-matching string in the table is slow, but I'm unsure how to optimize it for better performance. 我发现查询表中不匹配的字符串很慢,但是我不确定如何优化它以获得更好的性能。

Below shown is one of the tables: 下面显示的是表格之一:

In [5]: t
Out[5]: 
/gene/annotation (Table(315202,), fletcher32, blosc(5)) ''
  description := {
  "name": StringCol(itemsize=36, shape=(), dflt='', pos=0),
  "track": StringCol(itemsize=12, shape=(), dflt='', pos=1),
  "etype": StringCol(itemsize=12, shape=(), dflt='', pos=2),
  "event": StringCol(itemsize=36, shape=(), dflt='', pos=3)}
  byteorder := 'irrelevant'
  chunkshape := (1365,)
  autoindex := True
  colindexes := {
    "name": Index(9, full, shuffle, zlib(1)).is_csi=True}

When a condition matches something in the table, timeit returns in the microseconds. 当条件与表中的某项匹配时,timeit以微秒为单位返回。

In [6]: timeit [x for x in t.where("name == 'record_exists_in_table'")]
10000 loops, best of 3: 109 µs per loop

However, when I tried searching for a non-existence string, it is in the milliseconds. 但是,当我尝试搜索不存在的字符串时,它以毫秒为单位。

In [8]: timeit [x for x in t.where("name == 'no_such_record'")]
10 loops, best of 3: 56 ms per loop

Any advice that points me toward the right direction will be greatly appreciated! 任何将我引向正确方向的建议将不胜感激!

I've exhausted my search on the web and yet to find anything that resolves the issue. 我已经在网络上用尽了所有搜索,但仍未找到解决问题的任何方法。 So I've decided to use SeqIO.index_db() in biopython to create a separate index, then a check to make sure a condition will be found before executing a pytable query. 因此,我决定在biopython中使用SeqIO.index_db()创建单独的索引,然后执行检查以确保在执行pytable查询之前找到条件。 Not exactly the pretty solution I was looking for, but this will do. 并不是我一直在寻找的漂亮解决方案,但这可以做到。 It has substantially improved the performance on non-matching condition. 它大大提高了在不匹配条件下的性能。

In [6]: timeit [x for x in t.where("name == 'not_found_in_table'")]
10 loops, best of 3: 51.6 ms per loop

In [9]: timeit [x for x in t.search_by_gene('not_found_in_table')]
10000 loops, best of 3: 29.5 µs per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM