简体   繁体   English

将ascii编码转换为int并在python中再次返回(快速)

[英]Convert ascii encoding to int and back again in python (quickly)

I have a file format (fastq format) that encodes a string of integers as a string where each integer is represented by an ascii code with an offset. 我有一个文件格式(fastq格式),它将一个整数字符串编码为一个字符串,其中每个整数由带有偏移量的ascii代码表示。 Unfortunately, there are two encodings in common use, one with an offset of 33 and the other with an offset of 64. I typically have several 100 million strings of length 80-150 to convert from one offset to the other. 不幸的是,有两种常用的编码,一种偏移量为33,另一种偏移量为64.我通常有几个1亿个字符串,长度为80-150,可以从一个偏移量转换到另一个偏移量。 The simplest code that I could come up with for doing this type of thing is: 我可以用来做这类事情的最简单的代码是:

def phred64ToStdqual(qualin):
    return(''.join([chr(ord(x)-31) for x in qualin]))

This works just fine, but it is not particularly fast. 这很好用,但速度不是很快。 For 1 million strings, it takes about 4 seconds on my machine. 对于100万个字符串,我的机器大约需要4秒钟。 If I change to using a couple of dicts to do the translation, I can get this down to about 2 seconds. 如果我改用使用几个dicts进行翻译,我可以将其缩短到大约2秒。

ctoi = {}
itoc = {}
for i in xrange(127):
    itoc[i]=chr(i)
    ctoi[chr(i)]=i

def phred64ToStdqual2(qualin):
    return(''.join([itoc[ctoi[x]-31] for x in qualin]))

If I blindly run under cython, I get it down to just under 1 second. 如果我盲目地在cython下运行,我会把它降到不到1秒。
It seems like at the C-level, this is simply a cast to int, subtract, and then cast to char. 看起来像在C级,这只是一个转换为int,减去,然后转换为char。 I haven't written this up, but I'm guessing it is quite a bit faster. 我没有写这篇文章,但我猜它速度要快得多。 Any hints including how to better code a this in python or even a cython version to do this would be quite helpful. 任何提示,包括如何在python甚至cython版本中更好地编写代码都会非常有用。

Thanks, 谢谢,

Sean 肖恩

If you look at the code for urllib.quote, there is something that is similar to what you're doing. 如果你看一下urllib.quote的代码,就会有类似于你正在做的事情。 It looks like: 看起来像:

_map = {}
def phred64ToStdqual2(qualin):
    if not _map:
        for i in range(31, 127):
            _map[chr(i)] = chr(i - 31)
    return ''.join(map(_map.__getitem__, qualin))

Note that the above function works in case the mappings are not the same length (in urllib.quote, you have to take '%' -> '%25'. 请注意,上述函数适用于映射长度不同的情况(在urllib.quote中,您必须使用'%' - >'%25'。

But actually, since every translation is the same length, python has a function that does just this very quickly: maketrans and translate . 但实际上,由于每个翻译都是相同的长度,python有一个功能可以很快地完成这个: maketranstranslate You probably won't get much faster than: 你可能不会比以下快得多:

import string
_trans = None
def phred64ToStdqual4(qualin):
    global _trans
    if not _trans:
        _trans = string.maketrans(''.join(chr(i) for i in range(31, 127)), ''.join(chr(i) for i in range(127 - 31)))
    return qualin.translate(_trans)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM