简体   繁体   English

在预定义字符串上子集 python 中的列表

[英]Subset a list in python on pre-defined string

I have some extremely large lists of character strings I need to parse.我有一些非常大的字符串列表需要解析。 I need to break them into smaller lists based on a pre-defined character string, and I figured out a way to do it, but I worry that this will not be performant on my real data.我需要根据预定义的字符串将它们分成更小的列表,并且我想出了一种方法,但我担心这不会对我的真实数据产生影响。 Is there a better way to do this?有一个更好的方法吗?

My goal is to turn this list:我的目标是打开这个列表:

['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']

Into this list:进入此列表:

[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]

What I tried:我尝试了什么:

# List that replicates my data.  `string_to_split_on` is a fixed character string I want to break my list up on 
my_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']

# Inspect List
print(my_list)

# Create empty lists to store dat ain 
new_list = []
good_letters = []

# Iterate over each string in the list
for i in my_list:

    # If the string is the seporator, append data to new_list, reset `good_letters` and move to the next string
    if i == 'string_to_split_on':
        new_list.append(good_letters)
        good_letters = []
        continue

    # Append letter to the list of good letters
    else:
        good_letters.append(i)



# I just like printing things thay because its easy to read
for item in new_list:
    print(item)
    print('-'*100)


### Output
['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
['a', 'b']
----------------------------------------------------------------------------------------------------
['c', 'd', 'e', 'f', 'g']
----------------------------------------------------------------------------------------------------
['h', 'i', 'j', 'k']
----------------------------------------------------------------------------------------------------

You can also use one line of code:您也可以使用一行代码:

original_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
split_string = 'string_to_split_on'

new_list = [sublist.split() for sublist in ' '.join(original_list).split(split_string) if sublist]
print(new_list)

This approach is more efficient when dealing with large data set:这种方法在处理大数据集时更有效:

import itertools

new_list = [list(j) for k, j in itertools.groupby(original_list, lambda x: x != split_string) if k]
print(new_list)

[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM