简体   繁体   中英

Making python code faster for processing 24 million records

I am trying to process pandas dataframe. I am applying function to one of the column.

The function is:

def separate_string(sentence):
    string_even = ""
    if sentence is not None:
        l = list(sentence)
        list_even = list()
        index = 0    
        for letter in l:
            if index % 2 != 0:
               if abs(ord(letter)-3) < 1114111:
                    list_even.append((chr(abs(ord(letter)-3))))
               string_even = "".join(list_even)
            index += 1
    return(str(string_even))

Pandas dataframe:

df['re'] = df.col1.apply(separate_string)

I am running this on PC with 64GB RAM 2.19Ghz 7 processor. Why the code never completes?

If I were you, I'd try Cython izing your Python code. Essentially that would make it C code that would run (hopefully) orders of magnitude faster.

I think this does what you want. You might have to explicitly return None if you need that rather than an empty string.

There are a bunch of things removed like unneeded casts and manual maintenance of an index as well as a test that codepoints are less the than 1114111 as they all are going to be.

def separate_string(sentence):
    return "".join(chr(abs(ord(letter) -3)) for letter in sentence[1::2])

We can timeit to see if we have improved things:

import timeit

setup_orig = '''
test = "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever."
def separate_string(sentence):
    string_even = ""
    if sentence is not None:
        l = list(sentence)
        list_even = list()
        index = 0    
        for letter in l:
            if index % 2 != 0:
               if abs(ord(letter)-3) < 1114111:
                    list_even.append((chr(abs(ord(letter)-3))))
               string_even = "".join(list_even)
            index += 1
    return(str(string_even))
'''

setup_new = '''
test = "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever."
def separate_string(sentence):
    return "".join(chr(abs(ord(letter) -3)) for letter in sentence[1::2])
'''

print(timeit.timeit('separate_string(test)', setup=setup_orig, number=100_000))
print(timeit.timeit('separate_string(test)', setup=setup_new, number=100_000))

On my laptop that gives results like:

5.33
0.95

So it seems like it might be worth exploring as part of your solution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM