简体   繁体   English

在python中随机播放字符串数据

[英]Shuffle string data in python

I have a column with 10 million strings. 我有一列有1000万个字符串。 The characters in the strings need to be rearranged in a certain way. 字符串中的字符需要以某种方式重新排列。

Original string: AAA01188P001 原始字符串: AAA01188P001

Shuffled string: 188A1A0AP001 188A1A0AP001字符串: 188A1A0AP001

Right now I have a for loop running that takes each string and repositions every letter, but this takes hours to completed. 现在,我正在运行一个for循环,该循环接受每个字符串并重新定位每个字母,但这需要几个小时才能完成。 Is there a quicker way to achieve this result? 有没有更快的方法来达到这个结果?

This is the for loop. 这是for循环。

for i in range(0, len(OrderProduct)):
    s = list(OrderProduct['OrderProductId'][i])
    a = s[1]
    s[1] = s[7]
    s[7] = a 
    a = s[3]
    s[3] = s[6]
    s[6] = a 
    a = s[2]
    s[2] = s[3]
    s[3] = a 
    a = s[5]
    s[5] = s[0]
    s[0] = a 
    OrderProduct['OrderProductId'][i] = ''.join(s)

I made a few performance tests using different methods: 我使用不同的方法进行了一些性能测试:

Here are the results I got for 1000000 shuffles: 这是我获得1000000次随机播放的结果:

188A1AA0P001 usefString 0.518183742
188A1AA0P001 useMap     1.415851829
188A1AA0P001 useConcat  0.5654986979999999
188A1AA0P001 useFormat  0.800639699
188A1AA0P001 useJoin    0.5488918539999998

based on this, a format string with hard coded substrings seems to be the fastest. 基于此,带有硬编码子字符串的格式字符串似乎是最快的。

Here is the code I used to test: 这是我用来测试的代码:

def usefString(s): return f"{s[5:8]}{s[0]}{s[4]}{s[1:4]}{s[8:]}"

posMap = [5,6,7,0,4,1,2,3,8,9,10,11]
def useMap(s): return "".join(map(lambda i:s[i], posMap))

def useConcat(s): return s[5:8]+s[0]+s[4]+s[1:4]+s[8:]

def useFormat(s): return '{}{}{}{}{}'.format(s[5:8],s[0],s[4],s[1:4],s[8:])

def useJoin(s): return "".join([s[5:8],s[0],s[4],s[1:4],s[8:]])

from timeit import timeit
count = 1000000
s = "AAA01188P001"

t = timeit(lambda:usefString(s),number=count)
print(usefString(s),"usefString",t)

t = timeit(lambda:useMap(s),number=count)
print(useMap(s),"useMap",t)

t = timeit(lambda:useConcat(s),number=count)
print(useConcat(s),"useConcat",t)

t = timeit(lambda:useFormat(s),number=count)
print(useFormat(s),"useFormat",t)

t = timeit(lambda:useJoin(s),number=count)
print(useJoin(s),"useJoin",t)

Performance : (added by @jezrael) 表演 :( 由@jezrael添加)

N = 1000000
OrderProduct = pd.DataFrame({'OrderProductId':['AAA01188P001'] * N})

In [331]: %timeit [f'{s[5:8]}{s[0]}{s[4]}{s[1:4]}{s[8:]}' for s in OrderProduct['OrderProductId']]
527 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [332]: %timeit [s[5:8]+s[0]+s[4]+s[1:4]+s[8:] for s in OrderProduct['OrderProductId']]
610 ms ± 18.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [333]: %timeit ['{}{}{}{}{}'.format(s[5:8],s[0],s[4],s[1:4],s[8:]) for s in OrderProduct['OrderProductId']]
954 ms ± 76.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [334]: %timeit ["".join([s[5:8],s[0],s[4],s[1:4],s[8:]]) for s in OrderProduct['OrderProductId']]
594 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Can you just reconstruct the string with slices if that logic is consistent? 如果逻辑一致,是否可以只用切片来重建字符串?

s = OrderProduct['OrderProductId'][i]
new_s = s[5]+s[7]+s[1:2]+s[6]+s[4]+s[0]+s[3]+s[1]

or as a format string: 或作为格式字符串:

new_s = '{}{}{}{}{}{}{}'.format(s[5],s[7]...)

Edit : +1 for Dave's suggestion of ''.join() the list vs. concatenation. 编辑:+1为戴夫的建议''.join()列表与串联。

If you just want to shuffle the strings (no particular logic), you can do that in a several ways: 如果只想对字符串进行混洗(没有特殊逻辑),则可以通过以下几种方法进行处理:

Using string_utils: 使用string_utils:

import string_utils
print string_utils.shuffle("random_string")

Using built-in methods: 使用内置方法:

import random
str_var = list("shuffle_this_string")
random.shuffle(str_var)
print ''.join(str_var)

Using numpy: 使用numpy:

import numpy
str_var = list("shuffle_this_string")
numpy.random.shuffle(str_var)
print ''.join(str_var)

But if you need to do so with a certain logic (eg put each element in a specific position), you can do this: 但是,如果您需要使用某种逻辑(例如,将每个元素放在特定位置),则可以执行以下操作:

s = 'some_string'
s = ''.join([list(s)[i] for i in [1,6,2,7,9,4,0,8,5,10,3]])
print(s)

Output: 输出:

otmrn_sisge

If this is still taking too long, you can use multiprocessing. 如果仍然花费太长时间,则可以使用多处理。 Like this: 像这样:

from multiprocessing import Pool
p = Pool(4) # 4 is the number of workers. usually is set to the number of CPU cores

def shuffle_str(s):
    # do shuffling here, and return


list_of_strings = [...]
list_of_results = p.map(shuffle_str, list_of_strings)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM