简体   繁体   中英

How to filter list based on multiple conditions?

I have the following lists:

target_list = ["FOLD/AAA.RST.TXT"]

and

mylist = 
[
  "FOLD/AAA.RST.12345.TXT",
  "FOLD/BBB.RST.12345.TXT",
  "RUNS/AAA.FGT.12345.TXT",
  "FOLD/AAA.RST.87589.TXT",
  "RUNS/AAA.RST.11111.TXT"
]

How can I filter only those records of mylist that correspond to target_list ? The expected result is:

  "FOLD/AAA.RST.12345.TXT"
  "FOLD/AAA.RST.87589.TXT"

The following mask is considered for filtering mylist

xxx/yyy.zzz.nnn.txt

If xxx , yyy and zzz coincide with target_list , then the record should be selected. Otherwise it should be dropped from the result.

How can I solve this task withou using for loop?

selected_list = []
for t in target_list:
   r1 = l.split("/")[0]
   a1 = l.split("/")[1].split(".")[0]
   b1 = l.split("/")[1].split(".")[1]

   for l in mylist:
      r2 = l.split("/")[0]
      a2 = l.split("/")[1].split(".")[0]
      b2 = l.split("/")[1].split(".")[1]

      if (r1==r2) & (a1==a2) & (b1==b2):
         selected_list.append(l)

Define a function to filter values:

target_list = ["FOLD/AAA.RST.TXT"]

def keep(path):
    template = get_template(path)
    return template in target_list

def get_template(path):
    front, numbers, ext = path.rsplit('.', 2)
    template = '.'.join([front, ext])
    return template

This uses str.rsplit which searches the string in reverse and splits it on the given character, . in this case. The parameter 2 means it only performs at most two splits. This gives us three parts, the front, the numbers, and the extension:

>>> 'FOLD/AAA.RST.12345.TXT'.rsplit('.', 2)
['FOLD/AAA.RST', '12345', 'TXT']

We assign these to front , numbers and ext .

We then build a string again using str.join

>>> '.'.join(['FOLD/AAA.RST', 'TXT']
'FOLD/AAA.RST.TXT'

So this is what get_template returns:

>>> get_template('FOLD/AAA.RST.12345.TXT')
'FOLD/AAA.RST.TXT'

We can use it like so:

mylist = [
    "FOLD/AAA.RST.12345.TXT",
    "FOLD/BBB.RST.12345.TXT",
    "RUNS/AAA.FGT.12345.TXT",
    "FOLD/AAA.RST.87589.TXT",
    "RUNS/AAA.RST.11111.TXT"
]

from pprint import pprint

pprint(filter(keep, mylist))

Output:

['FOLD/AAA.RST.12345.TXT'
 'FOLD/AAA.RST.87589.TXT']

You can define a "filter-making function" that preprocesses the target list. The advantages of this are:

  • Does minimal work by caching information about target_list in a set: The total time is O(N_target_list) + O(N) , since set lookups are O(1) on average.
  • Does not use global variables. Easily testable.
  • Does not use nested for loops
def prefixes(target):
    """ 
    >>> prefixes("FOLD/AAA.RST.TXT")
    ('FOLD', 'AAA', 'RST')

    >>> prefixes("FOLD/AAA.RST.12345.TXT")
    ('FOLD', 'AAA', 'RST')
    """
    x, rest = target.split('/')
    y, z, *_ = rest.split('.')
    return x, y, z

def matcher(target_list):
    targets = set(prefixes(target) for target in target_list)
    def is_target(t):
        return prefixes(t) in targets
    return is_target

Then, you could do:

>>> list(filter(matcher(target_list), mylist))
['FOLD/AAA.RST.12345.TXT', 'FOLD/AAA.RST.87589.TXT']

You can use regular expressions to define a pattern, and check if your strings match that pattern.

In this case, split the target and insert a \\d+ in between the xxx/yyy.zzz. and the .txt part. Use this as the pattern.

The pattern \\d+ means any number of digits. The rest of the pattern will be created based on the literal values of xxx/yyy.zzz and .txt . Since the period has a special meaning in regular expressions, we have to escape it with a \\ .

import re

selected_list = []
for target in target_list:
    base, ext = target.rsplit(".", 1)
    pat = ".".join([base, "\d+", ext] ).replace(".", "\.")
    selected_list.append([s for s in mylist if re.match(pat, s) is not None])
print(selected_list)
#[['FOLD/AAA.RST.12345.TXT', 'FOLD/AAA.RST.87589.TXT']]

If the pattern does not match, re.match returns None .

Why not use filter + lambda function:

import re
result=list(filter(lambda item: re.sub(r'.[0-9]+', '', item) == target_list[0], mylist))

Some comments:

  • The approach is to exclude digits from the comparison. So in the lambda function, for each mylist item we replace digits with '', then compare against the only item in target_list, target_list[0].
  • filter will match all items where the lambda function is True
  • Wrap everything in list to convert from filter object to list object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM