检查熊猫数据库中的字符串是否包含子字符串并删除

Question

I am cleaning a column of a pandas data frame 'PERCENTAGE_AFFECTED'. 我正在清理熊猫数据框“ PERCENTAGE_AFFECTED”的一列。 It has contains integer ranges (eg: "70-80", "70 and 80", "65 to 70"). 它包含整数范围（例如：“ 70-80”，“ 70和80”，“ 65至70”）。

I am trying to create a function to clean all these up to create integer averages. 我正在尝试创建一个函数来清理所有这些以创建整数平均值。

THIS WORKS>>> 这项工作>>>

def clean_split_range(row):
# Initial value contains the current value for the PERCENTAGE AFFECTED column
initial_perc = str(row['PERCENTAGE_AFFECTED'])
chars = '<>!,?":;() '

#Remove chars in initial value
if any(c in chars for c in initial_perc): 
    split_range =[]
    cleanWord = ""
    for char in initial_perc:            
        if char in chars:
            char = ""
        cleanWord += char
    split_range.append(cleanWord)
    initial_perc = ''.join(split_range)



#Split initial_perc into two elements if "-" is found   
split_range = initial_perc.split('-')
# If a "-"  is found, split_date will contain a list with two items
if len(split_range) > 1:        
    try:
        final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range))) / (len(split_range)))
    except ValueError:
        split_range = split_range[0].split('+')
        final_perc = split_range[0]            
    finally:
        if str(final_perc).isalpha():
            final_perc = 0

elif initial_perc.find('and') != -1:
    split_other = initial_perc.split('and')
    if len(split_other) > 1:
        try:
            final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other))) / (len(split_other)))
        except ValueError:
            split_other = split_other[0].split('+')
            final_perc = split_other[0]
        finally:
            if str(final_perc).isalpha():
                final_perc = 0

elif initial_perc.find('to') != -1:
    split_other = initial_perc.split('to')
    if len(split_other) > 1:
        try:
            final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_other))) / (len(split_other)))
        except ValueError:
            split_other = split_other[0].split('+')
            final_perc = split_other[0]
        finally:
            if str(final_perc).isalpha():
                final_perc = 0   



elif initial_perc.find('±') != -1:
    split_other = initial_perc.split('±')
    final_perc = split_other[0]

elif initial_perc.startswith('over'):
    split_other = initial_perc.split('over')
    final_perc = split_other[1]     

elif initial_perc.find('around') != -1:
    split_other = initial_perc.split('around')
    final_perc = split_other[1]



elif initial_perc.isalpha():
    final_perc = 0

# If no "-" is found, split_date will just contain 1 item, the initial_date
else:
    final_perc = initial_perc

return final_perc

BUT: I am trying to simplify this so that if the entry contains the "-", "and", "to" substring. 但是：我正在尝试简化此过程，以便如果条目包含“-”，“ and”和“ to”子字符串。 I have created a list of substrings (split_list) that I want to split by and remove: 我创建了一个要分割并删除的子字符串列表（split_list）：

def new_clean_split_range(row):
# Initial value contains the current value for the PERCENTAGE AFFECTED column
initial_perc = str(row['PERCENTAGE_AFFECTED'])
chars = '<>!,?":;() '
split_list = ['-','and']



# Split initial_perc into two elements if "-" is found    
if any(a in initial_perc for a in split_list):
    for a in split_list:
        split_range = initial_perc.split(a)
        # If a "-"  is found in split_list, initial_perc will contain a list with two items
        if len(split_range) > 1:        
            try:
                final_perc = int(reduce(lambda x, y: x + y, list(map(int, split_range))) / (len(split_range)))
            except ValueError:
                split_range = split_range[0].split('+')
                final_perc = split_range[0]            
            finally:
                if str(final_perc).isalpha():
                    final_perc = 0
        else:
            final_perc = initial_perc  



#Remove chars in initial value
if any(c in chars for c in initial_perc): 
    split_range =[]
    cleanWord = ""
    for char in initial_perc:            
        if char in chars:
            char = ""
        cleanWord += char
    split_range.append(cleanWord)
    initial_perc = ''.join(split_range)
    split_range = ''    



elif initial_perc.find('±') != -1:
    split_other = initial_perc.split('±')
    final_perc = split_other[0]

elif initial_perc.startswith('over'): 
    split_other = initial_perc.split('over')
    final_perc = split_other[1]     

elif initial_perc.find('around') != -1:
    split_other = initial_perc.split('around')
    final_perc = split_other[1]









elif initial_perc.isalpha():
    final_perc = 0

# If no "-" is found, split_date will just contain 1 item, the initial_date
else:
    final_perc = initial_perc

return final_perc

Any help would be great :) 任何帮助将是巨大的:)

Answer 1

I would suggest to use the regex. 我建议使用正则表达式。

check this out. 看一下这个。

import re
results = re.findall(r"(\d{2,3}\.?\d*).*?(\d{2,3}\.?\d*)", x).pop() #x is input
print results
#results will be tuple and you can handle it easily.

checked with follwoing input and outputs, 检查以下输入和输出，

Input 输入
'70.5894-80.9894' '70 .5894-80.9894'
'70 and 85', '70和85'，
'65 to 70', '65至70'，
'72 <>75' '72 <> 75'

output 产量
('70.5894', '80.9894') （“ 70.5894”，“ 80.9894”）
('70', '85') （“ 70”，“ 85”）
('65', '70') （“ 65”，“ 70”）
('72', '75') （“ 72”，“ 75”）

检查熊猫数据库中的字符串是否包含子字符串并删除

问题描述

1 个解决方案

解决方案1
0 2017-06-16 05:14:40

检查熊猫数据库中的字符串是否包含子字符串并删除

问题描述

1 个解决方案

解决方案1 0 2017-06-16 05:14:40

解决方案1
0 2017-06-16 05:14:40