pytables在查询不匹配的字符串时变慢

Question

I'm relatively new in python and I'm using pytables to store some genomic annotations in hdf for faster query. 我在python中相对较新，我正在使用pytables在hdf中存储一些基因组注释，以加快查询速度。 I find querying a non-matching string in the table is slow, but I'm unsure how to optimize it for better performance. 我发现查询表中不匹配的字符串很慢，但是我不确定如何优化它以获得更好的性能。

Below shown is one of the tables: 下面显示的是表格之一：

In [5]: t
Out[5]: 
/gene/annotation (Table(315202,), fletcher32, blosc(5)) ''
  description := {
  "name": StringCol(itemsize=36, shape=(), dflt='', pos=0),
  "track": StringCol(itemsize=12, shape=(), dflt='', pos=1),
  "etype": StringCol(itemsize=12, shape=(), dflt='', pos=2),
  "event": StringCol(itemsize=36, shape=(), dflt='', pos=3)}
  byteorder := 'irrelevant'
  chunkshape := (1365,)
  autoindex := True
  colindexes := {
    "name": Index(9, full, shuffle, zlib(1)).is_csi=True}

When a condition matches something in the table, timeit returns in the microseconds. 当条件与表中的某项匹配时，timeit以微秒为单位返回。

In [6]: timeit [x for x in t.where("name == 'record_exists_in_table'")]
10000 loops, best of 3: 109 µs per loop

However, when I tried searching for a non-existence string, it is in the milliseconds. 但是，当我尝试搜索不存在的字符串时，它以毫秒为单位。

In [8]: timeit [x for x in t.where("name == 'no_such_record'")]
10 loops, best of 3: 56 ms per loop

Any advice that points me toward the right direction will be greatly appreciated! 任何将我引向正确方向的建议将不胜感激！

Answer 1

I've exhausted my search on the web and yet to find anything that resolves the issue. 我已经在网络上用尽了所有搜索，但仍未找到解决问题的任何方法。 So I've decided to use SeqIO.index_db() in biopython to create a separate index, then a check to make sure a condition will be found before executing a pytable query. 因此，我决定在biopython中使用SeqIO.index_db()创建单独的索引，然后执行检查以确保在执行pytable查询之前找到条件。 Not exactly the pretty solution I was looking for, but this will do. 并不是我一直在寻找的漂亮解决方案，但这可以做到。 It has substantially improved the performance on non-matching condition. 它大大提高了在不匹配条件下的性能。

In [6]: timeit [x for x in t.where("name == 'not_found_in_table'")]
10 loops, best of 3: 51.6 ms per loop

In [9]: timeit [x for x in t.search_by_gene('not_found_in_table')]
10000 loops, best of 3: 29.5 µs per loop

pytables在查询不匹配的字符串时变慢

问题描述

1 个解决方案

解决方案1
0 2014-08-21 14:11:00

pytables在查询不匹配的字符串时变慢

问题描述

1 个解决方案

解决方案1 0 2014-08-21 14:11:00

解决方案1
0 2014-08-21 14:11:00