Python Check if list item does (not) contain any of other list items

Question

I have this problem where I want to remove a list element if it contains 'illegal' characters. The legal characters are specified in multiple lists. They are formed like this, where alpha stands for the alphabet (az + AZ), digit stands for digits (0-9) and punct stands for punctuation (sort of).

alpha = list(string.ascii_letters)
digit = list(string.digits)
punct = list(string.punctuation)

This way I can specify something as an illegal character if it doesn't appear in one of these lists.

After that I have a list containing elements:

Input = ["Amuu2", "Q1BFt", "dUM€n", "o°8o1G", "mgF)`", "ZR°p", "Y9^^M", "W0PD7"]

I want to filter out the elements containing illegal characters. So this is the result I want to get (doesn't need to be ordered):

var = ["Amuu2", "Q1BFt", "mgF)`", "Y9^^M", "W0PD7"]

EDIT:

I have tried (and all variants of it):

for InItem in Input:
    if any(AlItem in InItem for AlItem in alpha+digit+punct):
        FilInput.append(InItem)

where a new list is created with only the filtered elements, but the problem here is that the elements get added when the contain at least one legal character. For example: "ZR°p" got added, because it contains a Z, R and a p.

I also tried:

for InItem in Input:
    if not any(AlItem in InItem for AlItem in alpha+digit+punct):

but after that, I couldn't figure out how to remove the element. Oh, and a little tip, to make it extra difficult, it would be nice if it were a little bit fast because it needs to be done millions of times. But it needs to be working first.

Answer 1

Define a set of legal characters. Then apply a list comprehension.

>>> allowed = set(string.ascii_letters + string.digits + string.punctuation)
>>> inp = ["Amuu2", "Q1BFt", "dUM€n", "o°8o1G", "mgF)`", "ZR°p", "Y9^^M", "W0PD7"]
>>> [x for x in inp if all(c in allowed for c in x)]
['Amuu2', 'Q1BFt', 'mgF)`', 'Y9^^M', 'W0PD7']

Answer 2

You can use a list comprehension and check with all if all characters match your criteria:

>>> [element for element in Input if all(c in alpha + digit + punct for c in element)]
['Amuu2', 'Q1BFt', 'mgF)`', 'Y9^^M', 'W0PD7']

Answer 3

Your code

As you mentioned, you append words as soon as any character is a correct one. You need to check that they are all correct:

filtered_words = []
for word in words:
    if all(char in alpha+digit+punct for char in word):
        filtered_words.append(word)

print(filtered_words)
# ['Amuu2', 'Q1BFt', 'mgF)`', 'Y9^^M', 'W0PD7']

You could also check that there's not a single character which isn't correct:

filtered_words = []
for word in words:
    if not any(char not in alpha+digit+punct for char in word):
        filtered_words.append(word)

print(filtered_words)

It's much less readable though.

For efficiency, you shouldn't concatenate lists during each iteration with alpha+digit+punct . You should do it once and for all, before any loop. It's also a good idea to create a set out of those lists, because char in set is much faster than char in list when there are many allowed characters.

Finally, you could use a list comprehension to avoid the for loop. If you do all this, you end up with @timgeb's solution :)

Alternative with regex

You can create a regex pattern from your lists and see which words match:

# encoding: utf-8
import string
import re

alpha = list(string.ascii_letters)
digit = list(string.digits)
punct = list(string.punctuation)

words = ["Amuu2", "Q1BFt", "dUM€n", "o°8o1G", "mgF)`", "ZR°p", "Y9^^M", "W0PD7"]

allowed_pattern = re.compile(
    '^[' +
    ''.join(
        re.escape(char) for char in (
            alpha +
            digit +
            punct)) +
    ']+$')
# ^[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^_\`\{\|\}\~]+$

print([word for word in words if allowed_pattern.match(word)])
# ['Amuu2', 'Q1BFt', 'mgF)`', 'Y9^^M', 'W0PD7']

You could also write:

print(list(filter(allowed_pattern.match, words)))
# ['Amuu2', 'Q1BFt', 'mgF)`', 'Y9^^M', 'W0PD7']

re.compile will probably require more time than simply initializing a set but the filtering might be faster then.

Answer 4

This is a "not" efficient solution for your problem but it can be interesting for learning how to loop a list, chars, etc.

# coding=utf-8
import string

# Aux var
result =[]
new_elem = ""

# lists with legal characters
alpha = list(string.ascii_letters)
digit = list(string.digits)
punct = list(string.punctuation)

# Input strings
Input = ["Amuu2", "Q1BFt", "dUM€n", "o°8o1G", "mgF)`", "ZR°p", "Y9^^M", "W0PD7"]

# Loop all elements of the list and each char of them
for elem in Input:
    ## check each char 
    for char in elem:
        if char in alpha:
            #print 'is ascii'
            new_elem += char
        elif char in digit:
            #print 'is digit'
            new_elem += char
        elif char in punct:
            #print 'is punct'
            new_elem += char
        else:
            new_elem = ""
            break
    ## Add to result list
    if new_elem != "":
        result.append(new_elem)
        new_elem = ""

print result

Python Check if list item does (not) contain any of other list items

Question

4 answers

solution1
5 2017-07-27 12:22:00

solution2
1 2017-07-27 12:21:46

solution3
1 ACCPTED 2017-07-27 12:22:44

Your code

Alternative with regex

solution4
1 2017-07-27 12:42:05

Python Check if list item does (not) contain any of other list items

Question

4 answers

solution1 5 2017-07-27 12:22:00

solution2 1 2017-07-27 12:21:46

solution3 1 ACCPTED 2017-07-27 12:22:44

Your code

Alternative with regex

solution4 1 2017-07-27 12:42:05

solution1
5 2017-07-27 12:22:00

solution2
1 2017-07-27 12:21:46

solution3
1 ACCPTED 2017-07-27 12:22:44

solution4
1 2017-07-27 12:42:05