简体   繁体   English

使 Python 代码更快处理 2400 万条记录

[英]Making python code faster for processing 24 million records

I am trying to process pandas dataframe.我正在尝试处理熊猫数据框。 I am applying function to one of the column.我正在将函数应用于其中一列。

The function is:功能是:

def separate_string(sentence):
    string_even = ""
    if sentence is not None:
        l = list(sentence)
        list_even = list()
        index = 0    
        for letter in l:
            if index % 2 != 0:
               if abs(ord(letter)-3) < 1114111:
                    list_even.append((chr(abs(ord(letter)-3))))
               string_even = "".join(list_even)
            index += 1
    return(str(string_even))

Pandas dataframe:熊猫数据框:

df['re'] = df.col1.apply(separate_string)

I am running this on PC with 64GB RAM 2.19Ghz 7 processor.我在带有 64GB RAM 2.19Ghz 7 处理器的 PC 上运行它。 Why the code never completes?为什么代码永远不会完成?

If I were you, I'd try Cython izing your Python code.如果我是你,我会尝试使用Cython 来优化你的 Python 代码。 Essentially that would make it C code that would run (hopefully) orders of magnitude faster.从本质上讲,这将使 C 代码的运行速度(希望)快几个数量级。

I think this does what you want.我认为这可以满足您的要求。 You might have to explicitly return None if you need that rather than an empty string.如果需要,您可能必须明确返回None而不是空字符串。

There are a bunch of things removed like unneeded casts and manual maintenance of an index as well as a test that codepoints are less the than 1114111 as they all are going to be.删除了很多东西,例如不需要的强制转换和索引的手动维护以及代码点小于 1114111 的测试,因为它们都将是。

def separate_string(sentence):
    return "".join(chr(abs(ord(letter) -3)) for letter in sentence[1::2])

We can timeit to see if we have improved things:我们可以timeit看看我们是否有改进:

import timeit

setup_orig = '''
test = "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever."
def separate_string(sentence):
    string_even = ""
    if sentence is not None:
        l = list(sentence)
        list_even = list()
        index = 0    
        for letter in l:
            if index % 2 != 0:
               if abs(ord(letter)-3) < 1114111:
                    list_even.append((chr(abs(ord(letter)-3))))
               string_even = "".join(list_even)
            index += 1
    return(str(string_even))
'''

setup_new = '''
test = "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever."
def separate_string(sentence):
    return "".join(chr(abs(ord(letter) -3)) for letter in sentence[1::2])
'''

print(timeit.timeit('separate_string(test)', setup=setup_orig, number=100_000))
print(timeit.timeit('separate_string(test)', setup=setup_new, number=100_000))

On my laptop that gives results like:在我的笔记本电脑上,结果如下:

5.33
0.95

So it seems like it might be worth exploring as part of your solution.因此,作为解决方案的一部分,它似乎值得探索。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM