简体   繁体   中英

Regex filter items in list to have only those items which DO contain a character that isn't a-z

I have tried so many regex combinations that I am unsure if the problem is my regex or my python coding (being fairly new to both).

I have a list called inputs :

inputs = [':boy', '_144-', '_1445', '_1445', 'alpha', 'monkey', '#sdakm', '.file', '.magic']

I want to end up with a list that contains only those items which do have a non-alphabetic character (unknown) in them.

So I want to find:

newlist = [':boy', '_144-', '_1445', '#sdakm', '.file', '.magic']

without the items that are all [az]. I also want to filter out any duplicate matches (of any type).

My python code is as follows:

import os, sys, re, string, codecs, cchardet, chardet

inputs = [':boy', '_144-', '_1445', '_1445', 'alpha', 'monkey', '#sdakm', '.file', '.magic']

regex = re.compile('.*[^abcdefghijklmnopqrstuvwxyz]*.*')
myset = set()
inputs_filtered=[]
for inp in inputs:
    if re.search(i,inp):
        if inp not in myset:
            inputs_filtered.append(inp)
            print('adding' + inp)
            myset.add(inp)
            ofile.write(inp + '\n')
        else:
            print('removing duplicate ' + inp)
    else:
        print("IS ALL LETTERS " + i)
print(myset)
ofile.close()

regex I have tried either to filter out or keep (I have tried so many different ways including different codes using things like:

[filter(lambda i: regex.search(i), inputs)]

'\".*[\W|\.|_|\_|-|\-]*.*\"

'.*[^abcdefghijklmnopqrstuvwxyz]*.*'

'\"[\w]*\",?'

'[\w]*'

Another thing to not is that myset.add() seems to be producing an empty set yet for some odd reason inputs_filtered is being populated...I think.

Since you are using set in your example, it appears that the order of results do not matter. You can do this easily 2 ways. One with regex and the other without regex (why bother with regex when you don't have to).

With regex, you just need a simple regex [^az] . With the use of filter you can do the following:

# drop the IGNORECASE option if you only want lowercase
pat = re.compile(r'[^a-z]', re.IGNORECASE)

# using the function pat.search as your filter function
results = set(filter(pat.search, inputs))

If it suits your case, there is a function on the str class named isalpha that returns true if your string has only alpha characters. You can build your set using the following code:

results = { word for word in inputs if not word.isalpha() }

If you include the filterfalse function from itertools (the counterpoint of filter ), you can do the following:

from itertools import filterfalse
results = set(filterfalse(str.isalpha, inputs))

You can consider writing your own function to use with filter . Here's a function that also excludes colons or spaces:

def has_valid_characters(word):
    return not (word.isalpha() or 
                ' ' in word or 
                ':' in word)
# ...
results = set(filter(has_valid_characters, input))

If there are a bunch of other characters you'd want to exclude, you can use a regex or use the any function as part of your filter function:

def has_valid_characters_no_regex(word):
    return not (word.isalpha() or
                any(x in word for x in ' :#-'))

pat2 = re.compile('[- :#]')

def has_valid_characters_regex(word):
    return not (word.isalpha() or
                pat2.search(word))

You can use re.findall with \\W :

inputs = [':boy', '_144-', '_1445', '_1445', 'alpha', 'monkey', '#sdakm', '.file', '.magic']
final_inputs = list(filter(lambda x:re.findall('[\W_]', x), inputs))

Output:

[':boy', '_144-', '_1445', '_1445', '#sdakm', '.file', '.magic']

You have a * after [^abcdefghijklmnopqrstuvwxyz] which means match that 0 or more repetitions. Change it to a + so that means match it 1 or more repetitions.

You can abbreviate [^abcdefghijklmnopqrstuvwxyz] to [^az] .

>>> regex = re.compile('.*[^a-z]+.*')
>>> list(filter(lambda s: regex.match(s), inputs))
[':boy', '_144-', '_1445', '_1445', '#sdakm', '.file', '.magic']

You can also try without regex approach :

inputs = [':boy', '_144-', '_1445', '_1445', 'alpha', 'monkey', '#sdakm', '.file', '.magic']


import unicodedata
import sys

symbols=[chr(i) for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')]


print([j for i in symbols for j in inputs if i in j])

output:

['#sdakm', '_144-', '.file', '.magic', ':boy', '_144-', '_1445', '_1445']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM