简体   繁体   English

python中是否有更快的方法将字符串拆分为具有一百万个元素的列表中的子列表?

[英]Is there any faster way in python to split strings to sublists in a list with 1million elements?

I'm trying to help my friend to clean an order list dataframe with one million elements. 我正在尝试帮助我的朋友清理具有一百万个元素的订单列表数据框。

在此处输入图片说明

you can see that the product_name column should be a list, but they are in string type. 您会看到product_name列应为列表,但它们为字符串类型。 So I want to split them into sublists. 所以我想将它们分成子列表。

Here's my code: 这是我的代码:

order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
    i = i.replace('[', '')
    i = i.replace(']', '')
    i = i.replace('\'', '')
    cln_order_ls.append(i)

new_cln_order_ls = list()
for i in cln_order_ls:
    new_cln_order_ls.append(i.split(', '))

But in the 'split' part, it took lots of time to process. 但是在“拆分”部分,它花费了很多时间。 I'm wondering is there faster way to deal with it ? 我想知道是否有更快的处理方法?

Thanks~ 谢谢〜

EDIT 编辑

(I did not like last answer, it was too much confused, so I reordered it and tested I little bit more systematically). (我不喜欢最后一个答案,这太令人困惑了,所以我重新排序了一下,并更加系统地进行了测试)。

Long story short: 长话短说:

For speed, just use: 为了提高速度,只需使用:

def str_to_list(s):
    return s[1:-1].replace('\'', '').split(', ')


df['product_name'].apply(str_to_list).to_list()

Long story long: 长话短说:

Let's dissect your code: 让我们剖析您的代码:

order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
    i = i.replace('[', '')
    i = i.replace(']', '')
    i = i.replace('\'', '')
    cln_order_ls.append(i)

new_cln_order_ls = list()
for i in cln_order_ls:
    new_cln_order_ls.append(i.split(', '))

What you would really like to do is to have a function, say str_to_list() which converts your input str ing to a list . 您真正想做的是拥有一个函数,例如str_to_list() ,它将输入的str转换为list

For some reasons, you do it in multiple steps, but this is really not necessary. 由于某些原因,您可以分多个步骤进行操作,但这实际上不是必需的。 What you have so far, can be rewritten as: 到目前为止,您可以将其重写为:

def str_to_list_OP(s):
    return s.replace('[', '').replace(']', '').replace('\'', '').split(', ')

If you can assume that [ and ] are always the first and last char of your string, you can simplify this to: 如果可以假设[]始终是字符串的第一个字符和最后一个字符,则可以将其简化为:

def str_to_list(s):
    return s[1:-1].replace('\'', '').split(', ')

which should also be faster. 这也应该更快。

Alternative approaches would use regular expressions, eg: 替代方法将使用正则表达式,例如:

def str_to_list_regex(s):
    regex = re.compile(r'[\[\]\']')
    return re.sub(regex, '', s).split(', ')

Note that all approaches so far use split() . 请注意,到目前为止,所有方法都使用split() This is a quite fast implementation which approach C speed and hardly any Python construct would beat it. 这是一个相当快的实现,它以C速度运行 ,几乎没有Python构造可以击败它。

All these methods are quite unsafe as they do not take into account escaping properly, eg all of the above would fail for the following valid Python code: 所有这些方法都非常不安全,因为它们没有考虑到正确的转义,例如,对于以下有效的Python代码,上述所有方法都会失败:

['ciao', "pippo", 'foo, bar']

More robust alternative in this scenario would be: 在这种情况下,更可靠的选择是:

  1. ast.literal_eval which works for any valid Python code ast.literal_eval适用于任何有效的Python代码
  2. json.loads which actually requires valid JSON strings so it is not really an option here. json.loads实际上需要有效的JSON字符串,因此这里并不是一个真正的选择。

The speed for these solutions is compared here: 在这里比较这些解决方案的速度:

基准1

As you can see, safety comes at the price of speed. 如您所见,安全是以速度为代价的。

(these graphs are generated using these scripts with the following (这些图是使用这些脚本以及以下内容生成的

def gen_input(n):
    return str([str(x) for x in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)  
funcs = str_to_list_OP, str_to_list, str_to_list_regex, ast.literal_eval 


runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

Now let's concentrate to the looping. 现在让我们集中精力进行循环。 What you do is an explicit looping, and we know that Python is typically not terribly fast with that. 您要做的是一个显式循环,并且我们知道Python通常不会很快。 However, looping in a comprehension can be faster because it can generate more optimized code. 但是,理解循环可能会更快,因为它可以生成更多优化的代码。 Another approach would be to use a vectorized expression using Pandas primitives, either using apply() or with .str. 另一种方法是使用apply() Pandas原语的矢量化表达式,或者使用apply().str. chainings. 链条。

The following timings are obtained, indicating comprehensions to be the fastest for smaller inputs, although the vectorized solution (using apply ) catches up and eventually surpasses the comprehension: 获得以下时序,表明对于较小的输入,理解是最快的,尽管矢量化解决方案(使用apply )赶上了并最终超过了理解:

基准2

The following test functions were used: 使用了以下测试功能:

import pandas as pd


def str_to_list(s):
    return s[1:-1].replace('\'', '').split(', ')


def func_OP(df):
    order_ls = df['product_name'].tolist()
    cln_order_ls = list()
    for i in order_ls:
        i = i.replace('[', '')
        i = i.replace(']', '')
        i = i.replace('\'', '')
        cln_order_ls.append(i)
    new_cln_order_ls = list()
    for i in cln_order_ls:
        new_cln_order_ls.append(i.split(', '))
    return new_cln_order_ls


def func_QuangHoang(df):
    return df['product_name'].str[1:-1].str.replace('\'','').str.split(', ').to_list()


def func_apply_df(df):
    return df['product_name'].apply(str_to_list).to_list()


def func_compr(df):
    return [str_to_list(s) for s in df['product_name']]

with the following test code: 使用以下测试代码:

def gen_input(n):
    return pd.DataFrame(
        columns=('order_id', 'product_name'),
        data=[[i, "['ciao', 'pippo', 'foo', 'bar', 'baz']"] for i in range(n)])


def equal_output(a, b):
    return a == b


input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)  
funcs = func_OP, func_QuangHoang, func_apply_df, func_compr 


runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)

again using the same base scripts as before. 再次使用与以前相同的基本脚本

How about: 怎么样:

(df['product_name']
   .str[1:-1]
   .str.replace('\'','')
   .str.split(', ')
)

Try this 尝试这个

import ast
raw_df['product_name'] = raw_df['product_name'].apply(lambda x : ast.literal_eval(x))

I am curious about list comp as anky_91, so I gave it a try. 我对list comp为anky_91感到好奇,因此尝试了一下。 I do list comp directly on ndarray to save time on calling tolist 我确实直接在ndarray上列出comp以节省调用tolist时间

n = raw_df['product_name'].values
[x[1:-1].replace('\'', '').split(', ') for x in n]

Sample data: 样本数据:

In [1488]: raw_df.values
Out[1488]:
array([["['C1', 'None', 'None']"],
       ["['C1', 'C2', 'None']"],
       ["['C1', 'C1', 'None']"],
       ["['C1', 'C2', 'C3']"]], dtype=object)


In [1491]: %%timeit
      ...: n = raw_df['product_name'].values
      ...: [x[1:-1].replace('\'', '').split(', ') for x in n]
      ...:
16.2 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [1494]: %timeit my_func_2b(raw_df)
36.1 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [1493]: %timeit my_func_2(raw_df)
39.1 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [1492]: %timeit raw_df['product_name'].str[1:-1].str.replace('\'','').str.sp
      ...: lit(', ').tolist()
765 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, listcomp with chain replace and split is fastest. 因此,具有链replacesplit功能的listcomp是最快的。 Its speed is twice the next one. 它的速度是下一个的两倍。 However, the save time is actually on using ndarray without calling tolist . 但是,节省时间实际上是在使用ndarray而不调用tolist If I add tolist , differences is not significant. 如果我添加到tolist ,差异不大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM