简体   繁体   中英

Splitting list into sublists by given separator in python

I'm trying to build n-grams which don't cross a period symbol. Split() only works for functions and list[index] only works with an index. Is there a way to access/split/divide a list by giving it a string/an element? Here is a snippet of my current function:

text = ["split","this","stuff",".","my","dear"]

def generate_ngram(rawlist, ngram_order):
        """
        Input: List of words or characters, ngram-order ["this", "is", "an", "example"], 2
        Output: Set of tuples or words or characters {("this", "is"),("is","an"),...}
        """

    list_of_tuples = []
    for i in range(0, len(rawlist) - ngram_order + 1):
        ngram_order_index = i + ngram_order    
        generated_ngram = rawlist[i : ngram_order_index]

        #if "." in generated_ngram:
            #generated_ngram . . . 

        generated_tuple = tuple(generated_ngram)  
        list_of_tuples.append(generated_tuple)

    return set(list_of_tuples)

generate_ngram(text,3)

currently returns:

{('.', 'my', 'dear'),
 ('stuff', '.', 'my'),
 ('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

but it should ideally return:

{('split', 'this', 'stuff'),
 ('this', 'stuff', '.')}

Any idea on how to achieve this? Thanks for your help!

I'm not sure if this is exactly what you need, but this function generates ngrams that can only contain stop words (in this case period) at the end:

STOPWORDS = {"."}

def generate_ngram(rawlist, ngram_order):
    # All ngrams
    ngrams = zip(*(rawlist[i:] for i in range(ngram_order)))
    # Generate only those ngrams that do not contain stop words before the end
    return (ngram for ngram in ngrams if not any(w in STOPWORDS for w in ngram[:-1]))

text = ["split", "this", "stuff", ".", "my", "dear"]
print(*generate_ngram(text, 3), sep="\n")
# ('split', 'this', 'stuff')
# ('this', 'stuff', '.')
print(*generate_ngram(text, 2), sep="\n")
# ('split', 'this')
# ('this', 'stuff')
# ('stuff', '.')
# ('my', 'dear')

Note this function returns a generator. You can convert it to a list wrapping it with list(...) if you want, or you can directly iterate over it.

EDIT: You may find the equivalent syntax below more readable.

def generate_ngram(rawlist, ngram_order):
    # Iterate over all ngrams
    for ngram in zip(*(rawlist[i:] for i in range(ngram_order))):
        # Yield only those not containing stop words before the end
        if not any(w in STOPWORDS for w in ngram[:-1]):
            yield ngram

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM