简体   繁体   中英

Remove non-alphabetic characters from a list of lists and maintain structure

I'm working in python 2.7. I want to remove the non-alphabetic characters from each list in a list of lists without modifying the structure of the lists.

Starting example list of lists:

csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
print (csvarticles[0])

Desired Output:

[['beta blockers', 'magic', '1980', 'presse medicale'],['hypertension in the pregnant woman', '', '2010', 'medical'],['arterial hypertension', '', '1920', 'la nouvelle']]

Code 1:

csvarticles = [[word.lower().split() for word in nodeList] for nodeList in csvarticles]

print (csvarticles[0])

Code 1 Output:

['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'] [['[beta-blockers]'], ['magic!'], ['1980'], ['presse', 'medicale']]

Code 2:

csvarticles = [[word.lower().split() for word in nodeList if word.isalpha()] for nodeList in csvarticles]

Code 2 Output:

[]

Code 3:

articleTitle = []
for x, y in enumerate(csvarticles):
    myString = simpleWords(csvarticles[x][0])
    if myString is not '':
        myString = myString.lower()
        myString = re.sub('[\W_]+', ' ', myString, flags=re.UNICODE)
        myList = [word for word in myString.split() if len(word) > 3]
        articleTitle = ' '.join(myList)

Code 3 Output:

['beta blockers', 'magic', '1980', 'presse medicale', 'hypertension pregnant woman', '2010', 'medical', 'arterial hypertension', '1920', 'nouvelle']

Code 3 gets close but eliminates the structure of the nested lists.

you want to replace non-space or alphanum chars, and trim/lowercase the string. Regex are pretty efficient for those replacements, chained with str.strip .

Rebuild the nested lists in a double list comp:

import re

csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]

result = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in csvarticles]

print(result)

prints:

[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]

If you're using Python, replace lower by casefold to handle speciale locale chars

Use the string.isalnum() method to check if string is either alphabet or number.

Demo

csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
res = []
for i in csvarticles:
    r = []
    for j in i:
        r.append("".join([k for k in j if (k.isalnum() or k.isspace())]).lower())
    res.append(r)
print(res)

Output :

[['betablockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]

If you want to do this in a one-liner:

INPUT:

output = [[k.lower() for k in [' '.join(re.findall(r'[^\]\[.!-][A-z0-9]+[^\]\[.!-]', j)) for j in i]] for i in csvarticles]

OUTPUT:

[['beta blockers', 'magic', '1980', 'presse  medicale'], ['hypertension  in  the  pregnant  woman', '', '2010', 'medical'], ['arterial  hypertension', '', '1920', 'la  nouvelle']]

REGEX:

[^\]\[.!-][A-z0-9]+[^\]\[.!-]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM