I'm working in python 2.7. I want to remove the non-alphabetic characters from each list in a list of lists without modifying the structure of the lists.
Starting example list of lists:
csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
print (csvarticles[0])
Desired Output:
[['beta blockers', 'magic', '1980', 'presse medicale'],['hypertension in the pregnant woman', '', '2010', 'medical'],['arterial hypertension', '', '1920', 'la nouvelle']]
Code 1:
csvarticles = [[word.lower().split() for word in nodeList] for nodeList in csvarticles]
print (csvarticles[0])
Code 1 Output:
['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'] [['[beta-blockers]'], ['magic!'], ['1980'], ['presse', 'medicale']]
Code 2:
csvarticles = [[word.lower().split() for word in nodeList if word.isalpha()] for nodeList in csvarticles]
Code 2 Output:
[]
Code 3:
articleTitle = []
for x, y in enumerate(csvarticles):
myString = simpleWords(csvarticles[x][0])
if myString is not '':
myString = myString.lower()
myString = re.sub('[\W_]+', ' ', myString, flags=re.UNICODE)
myList = [word for word in myString.split() if len(word) > 3]
articleTitle = ' '.join(myList)
Code 3 Output:
['beta blockers', 'magic', '1980', 'presse medicale', 'hypertension pregnant woman', '2010', 'medical', 'arterial hypertension', '1920', 'nouvelle']
Code 3 gets close but eliminates the structure of the nested lists.
you want to replace non-space or alphanum chars, and trim/lowercase the string. Regex are pretty efficient for those replacements, chained with str.strip
.
Rebuild the nested lists in a double list comp:
import re
csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
result = [[re.sub("[^ \w]"," ",x).strip().lower() for x in y] for y in csvarticles]
print(result)
prints:
[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
If you're using Python, replace lower
by casefold
to handle speciale locale chars
Use the string.isalnum() method to check if string is either alphabet or number.
Demo
csvarticles = [['[Beta-blockers]', 'Magic!', '1980', 'Presse medicale'],['Hypertension in the pregnant woman].', '', '2010', 'Medical'],['Arterial hypertension.', '', '1920', 'La Nouvelle']]
res = []
for i in csvarticles:
r = []
for j in i:
r.append("".join([k for k in j if (k.isalnum() or k.isspace())]).lower())
res.append(r)
print(res)
Output :
[['betablockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
If you want to do this in a one-liner:
INPUT:
output = [[k.lower() for k in [' '.join(re.findall(r'[^\]\[.!-][A-z0-9]+[^\]\[.!-]', j)) for j in i]] for i in csvarticles]
OUTPUT:
[['beta blockers', 'magic', '1980', 'presse medicale'], ['hypertension in the pregnant woman', '', '2010', 'medical'], ['arterial hypertension', '', '1920', 'la nouvelle']]
REGEX:
[^\]\[.!-][A-z0-9]+[^\]\[.!-]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.