简体   繁体   English

解析和搜索字符串的更好方法是什么?

[英]Better way to parse and search a string?

I have been looking to speed up a basic Python function which basically just takes a line of text and checks the line for a substring. 我一直在寻找加速基本的Python函数,它基本上只需要一行文本并检查子行的行。 The Python program is as follows: Python程序如下:

import time

def fun(line):
    l = line.split(" ", 10)
    if 'TTAGGG' in l[9]:
        pass  # Do nothing

line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31"

time0 = time.time()
for i in range(10000):
    fun(line)
print time.time() - time0

I wanted to see if I could use some of the high level features of Rust to possibly gain some performance, but the code runs considerably slower. 我想看看是否可以使用Rust的一些高级功能来获得一些性能,但代码运行速度要慢得多。 The Rust conversion is: Rust转换是:

extern crate regex;
extern crate time;
use regex::Regex;

fn main() {
    let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";    
    let substring: &str = "TTAGGG";
    let time0: f64 = time::precise_time_s();

    for _ in 0..10000 {
        fun(line, substring);
    }

    let time1: f64 = time::precise_time_s();
    let elapsed: f64 = time1 - time0;
    println!("{}", elapsed);
}


fn fun(line: &str, substring: &str) {
    let l: Vec<&str> = line.split(" ")
                .enumerate()
                .filter(|&(i, _)| i==9)
                .map(|(_, e) | e)
                .collect();

    let re = Regex::new(substring).unwrap();    
    if re.is_match(&l[0]) {
        // Do nothing
    }
}

On my machine, Python times this at 0.0065s vs Rusts 1.3946s. 在我的机器上,Python的时间为0.0065秒vs Rusts 1.3946s。

Just checking some basic timings, the line.split() part of the code takes around 1s, and the regex step is around 0.4s. 只需检查一些基本时序,代码的line.split()部分大约需要1 line.split() ,正则表达式步line.split() Can this really be right, or is there an issue with timing this properly? 这真的是对的吗,还是有正确计时的问题?

As a baseline, I ran your Python program with Python 2.7.6. 作为基线,我使用Python 2.7.6运行Python程序。 Over 10 runs, it had a mean time of 12.2ms with a standard deviation of 443μs. 超过10次运行,平均时间为12.2ms,标准偏差为443μs。 I don't know how you got the very good time of 6.5ms . 我不知道你是如何度过6.5ms的好时光的。

Running your Rust code with Rust 1.4.0-dev ( febdc3b20 ), without optimizations, I got a mean of 958ms and a standard deviation of 33ms. 使用Rust 1.4.0-dev( febdc3b20 )运行Rust代码,没有优化,我的平均值为958ms,标准偏差为33ms。

Running your code with optimizations ( cargo run --release ), I got a mean of 34.6ms and standard deviation of 495μs. 使用优化运行代码( cargo run --release ),我的平均值为34.6ms,标准偏差为495μs。 Always do benchmarking in release mode . 始终在发布模式下进行基准测试

There are further optimizations you can do: 您可以进行进一步的优化:

Compiling the regex once, outside of the timing loop: 在时序循环之外编译一次正则表达式:

fn main() {
    // ...
    let substring = "TTAGGG";
    let re = Regex::new(substring).unwrap();

    // ...

    for _ in 0..10000 {
        fun(line, &re);
    }

    // ...
}

fn fun(line: &str, re: &Regex) {
    // ...
}

Produces an average of 10.4ms with a standard deviation of 678μs. 产生平均10.4ms,标准偏差为678μs。

Switching to a substring match: 切换到子字符串匹配:

fn fun(line: &str, substring: &str) {
    // ...

    if l[0].contains(substring) {
        // Do nothing
    }
}

Has a mean of 8.7ms and a standard deviation of 334μs. 平均值为8.7ms,标准偏差为334μs。

And finally, if you look at just the one result instead of collecting everything into a vector: 最后,如果你只查看一个结果而不是将所有内容都收集到一个向量中:

fn fun(line: &str, substring: &str) {
    let col = line.split(" ").nth(9);

    if col.map(|c| c.contains(substring)).unwrap_or(false) {
        // Do nothing
    }
}

Has a mean of 6.30ms and standard deviation of 114μs. 平均值为6.30ms,标准偏差为114μs。

A direct translation of the Python would be Python的直接翻译将是

extern crate time;

fn fun(line: &str) {
    let mut l = line.split(" ");
    if l.nth(9).unwrap().contains("TTAGGG") {
        // do nothing
    }
}

fn main() {
    let line = "FCC2CCMACXX:4:1105:10758:14389# 81 chrM 1 32 10S90M = 16151 16062 CATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTTTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCTTAGGGGATAGCATTG bbb^Wcbbbbccbbbcbccbba]WQG^bbcdcb_^_c_^`ccdddeeeeeffggggiiiiihiiiiihiiihihiiiihghhiihgfgfgeeeeebbb NM:i:1 AS:i:85 XS:i:65 RG:Z:1_DB31";

    let time0 = time::precise_time_s();
    for _ in 0..10000 {
        fun(line);
    }
    println!("{}", time::precise_time_s() - time0);
}

Using cargo run --release on stable (1.2.0), I get about 0.0267 as compared to about 0.0240 for Python (CPython, 2.7.10). 在稳定版(1.2.0)上使用cargo run --release ,我获得约0.0267 ,而Python约为0.0240 (CPython,2.7.10)。 Given Python's in on strings is just a C routine, this is reasonable. 鉴于Python的in对字符串仅仅是一个C例程,这是合理的。

Impressively, on beta (1.3.0) and nightly (1.4.0) this decreases to about just 0.0122 , or about twice the speed of CPython! 令人印象深刻的是,在beta(1.3.0)和nightly(1.4.0)上,这减少到大约0.0122 ,或者大约是CPython速度的两倍!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM