简体   繁体   中英

Python2.7: Extract slice from a list based on a pattern in a Pythonic way

I have a large set of data in a list. The list consists of short strings. Inside the list are slices of length 5 hidden that match a certain pattern:

[<date>, <date>, <4 digit integer>, <string>, <$ amount>]

How can I extract these slices from my data set? They can occur in any location (so their index is not guaranteed to be a multiple of 5) and are interspersed with other data (also strings), that can match part of the pattern.

I started with something similar to:

for item in data:
    if re.search(<date pattern>, item):
        if not date1:
            date1 = item
        else:
            date2 = item
    if re.search(<4 digit integer pattern>, item):
        if date1 and date2 and not fourdigit:
            fourdigit = item
        else:
           date1 = None
           date2 = None
    ....

But this is very complicated, prone to errors and not pythonic at all.

The next approach was to extract a sliding window of 5 items from the list of data and check that all items match their pattern. If not, increment the index by 1 (ie slide the window by 1) and check the next slice. If the pattern matches, save the slice, and increment the index by 5. Something like:

index = 0
while index < (len(data)-5):
    sliceof5 = data[index:index+5]
    if slice_matches_pattern(sliceof5):
        matching_items.append(sliceof5)
        index += 5
    else:
        index += 1

This works and is a lot easier to implement and less error prone previous solution, but doesn't seem very pythonic either.

Is it maybe possible to do this using list comprehension? Something like:

matching_items = [ sliceof5 if slice_matches_pattern(sliceof5) for sliceof5 in data ]

But then, how do I make the for in the list comprehension sometimes skip forward 1 and sometimes forward 5.

Are there maybe other, pythonic ways to achieve this?

Your second solution seems fine. I would change it to a generator ( yield the slices you find), but not more.

You can probably make it run faster, though, by looking for a date in item number 2 of the slice. If it's not a date, you can add two to the index.

Of course, if you can turn everything into one big regular expression that matches your entire pattern, you'll do even better.

Having no idea what your data looks like, following @hpaulj's idea, here's a regex approach.

import re

data = [
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100',  # collect
    '2016-12-02', 'spam',  # discard
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100',  # collect
    '1234', '2016-12-01',  # discard
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100',  # collect
    '$100', '1234', '1234',  # discard
    '2016-12-01', '2016-12-02', '1234', 'spam', '$100'  # collect
]

pattern_sep_str = '||'  # change to something unique in the data

pattern_sep = re.escape(pattern_sep_str)
date_pattern = r'[0-9]{4}-[0-9]{2}-[0-9]{2}'
int_pattern = r'[0-9]{4}'
str_pattern = r'[a-zA-Z]+'
amount_pattern = r'\$[0-9,.]+'

pattern_combined = ''.join([
    '(', date_pattern, pattern_sep, date_pattern, pattern_sep,
    int_pattern, pattern_sep, str_pattern, pattern_sep,
    amount_pattern, ')'
])

results = re.findall(pattern_combined, pattern_sep_str.join(data))

print([x.split(pattern_sep_str) for x in results])

>>> [['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100'], ['2016-12-01', '2016-12-02', '1234', 'spam', '$100']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM