[英]Check if string in pandas database contains substring and remove
I am cleaning a column of a pandas data frame 'PERCENTAGE_AFFECTED'. 我正在清理熊猫数据框“ PERCENTAGE_AFFECTED”的一列。 It has contains integer ranges (eg: "70-80", "70 and 80", "65 to 70").
它包含整数范围(例如:“ 70-80”,“ 70和80”,“ 65至70”)。
I am trying to create a function to clean all these up to create integer averages. 我正在尝试创建一个函数来清理所有这些以创建整数平均值。
THIS WORKS>>> 这项工作>>>
def clean_split_range(row):
# Initial value contains the current value for the PERCENTAGE AFFECTED column
initial_perc = str(row['PERCENTAGE_AFFECTED'])
chars = '<>!,?":;() '
#Remove chars in initial value
if any(c in chars for c in initial_perc):
split_range =[]
cleanWord = ""
for char in initial_perc:
if char in chars:
char = ""
cleanWord += char
split_range.append(cleanWord)
initial_perc = ''.join(split_range)
#Split initial_perc into two elements if "-" is found
split_range = initial_perc.split('-')
# If a "-" is found, split_date will contain a list with two items
if len(split_range) > 1:
try:
final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range))) / (len(split_range)))
except ValueError:
split_range = split_range[0].split('+')
final_perc = split_range[0]
finally:
if str(final_perc).isalpha():
final_perc = 0
elif initial_perc.find('and') != -1:
split_other = initial_perc.split('and')
if len(split_other) > 1:
try:
final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other))) / (len(split_other)))
except ValueError:
split_other = split_other[0].split('+')
final_perc = split_other[0]
finally:
if str(final_perc).isalpha():
final_perc = 0
elif initial_perc.find('to') != -1:
split_other = initial_perc.split('to')
if len(split_other) > 1:
try:
final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other))) / (len(split_other)))
except ValueError:
split_other = split_other[0].split('+')
final_perc = split_other[0]
finally:
if str(final_perc).isalpha():
final_perc = 0
elif initial_perc.find('±') != -1:
split_other = initial_perc.split('±')
final_perc = split_other[0]
elif initial_perc.startswith('over'):
split_other = initial_perc.split('over')
final_perc = split_other[1]
elif initial_perc.find('around') != -1:
split_other = initial_perc.split('around')
final_perc = split_other[1]
elif initial_perc.isalpha():
final_perc = 0
# If no "-" is found, split_date will just contain 1 item, the initial_date
else:
final_perc = initial_perc
return final_perc
BUT: I am trying to simplify this so that if the entry contains the "-", "and", "to" substring. 但是:我正在尝试简化此过程,以便如果条目包含“-”,“ and”和“ to”子字符串。 I have created a list of substrings (split_list) that I want to split by and remove:
我创建了一个要分割并删除的子字符串列表(split_list):
def new_clean_split_range(row):
# Initial value contains the current value for the PERCENTAGE AFFECTED column
initial_perc = str(row['PERCENTAGE_AFFECTED'])
chars = '<>!,?":;() '
split_list = ['-','and']
# Split initial_perc into two elements if "-" is found
if any(a in initial_perc for a in split_list):
for a in split_list:
split_range = initial_perc.split(a)
# If a "-" is found in split_list, initial_perc will contain a list with two items
if len(split_range) > 1:
try:
final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range))) / (len(split_range)))
except ValueError:
split_range = split_range[0].split('+')
final_perc = split_range[0]
finally:
if str(final_perc).isalpha():
final_perc = 0
else:
final_perc = initial_perc
#Remove chars in initial value
if any(c in chars for c in initial_perc):
split_range =[]
cleanWord = ""
for char in initial_perc:
if char in chars:
char = ""
cleanWord += char
split_range.append(cleanWord)
initial_perc = ''.join(split_range)
split_range = ''
elif initial_perc.find('±') != -1:
split_other = initial_perc.split('±')
final_perc = split_other[0]
elif initial_perc.startswith('over'):
split_other = initial_perc.split('over')
final_perc = split_other[1]
elif initial_perc.find('around') != -1:
split_other = initial_perc.split('around')
final_perc = split_other[1]
elif initial_perc.isalpha():
final_perc = 0
# If no "-" is found, split_date will just contain 1 item, the initial_date
else:
final_perc = initial_perc
return final_perc
Any help would be great :) 任何帮助将是巨大的:)
I would suggest to use the regex. 我建议使用正则表达式。
check this out. 看一下这个。
import re
results = re.findall(r"(\d{2,3}\.?\d*).*?(\d{2,3}\.?\d*)", x).pop() #x is input
print results
#results will be tuple and you can handle it easily.
checked with follwoing input and outputs, 检查以下输入和输出,
Input
输入
'70.5894-80.9894''70 .5894-80.9894'
'70 and 85','70和85',
'65 to 70','65至70',
'72 <>75''72 <> 75'
output
产量
('70.5894', '80.9894')(“ 70.5894”,“ 80.9894”)
('70', '85')(“ 70”,“ 85”)
('65', '70')(“ 65”,“ 70”)
('72', '75')(“ 72”,“ 75”)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.