如果在pandas数据框中非数字，则跳过对行的操作

Question

I have a dataframe: 我有一个数据框：

import pandas as pd
df = pd.DataFrame({'start' : [5, 10, '$%%', 20], 'stop' : [10, 20, 30, 40]})
df['length_of_region'] = pd.Series([0 for i in range(0, len(df['start']))])

I want to calculate length of region only for non-zero numeric row values and skip function for the row with an error note if the value is not right. 我只想计算非零数字行值的区域长度，如果值不正确，则跳过带有错误注释的行的函数。 Here is what I have so far: 这是我到目前为止的内容：

df['Notes'] = pd.Series(["" for i in range(0, len(df['region_name']))])

for i in range(0, len(df['start'])):
    if pd.isnull(df['start'][i]) == True:
        df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
        df['critical_error'][i] = True
        num_error = num_error+1
    else:
        try:
            #print (df['start'][i]).isnumeric()
            start = int(df['start'][i])
            #print start
            #print df['start'][i]
            if start == 0:
                raise ValueError
        except:
            df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
            #print df['start'][i]
            df['critical_error'][i] = True
            num_error = num_error+1
for i in range(0, len(df['start'][i])):
    if df['critical_error'][i] == True:
        continue
    df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

However, pandas converts df['start'] into a str variable and even if I use int to convert it, I get the following error: 但是， pandas将df['start']转换为str变量，即使我使用int进行转换，也会出现以下错误：

df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

TypeError: unsupported operand type(s) for -: 'numpy.int64' and 'str' TypeError：-：'numpy.int64'和'str'不支持的操作数类型

What am I missing here? 我在这里想念什么？ Thanks for your time! 谢谢你的时间！

Answer 1

You can define a custom function to do the calculation then apply that function to each row. 您可以定义一个自定义函数来进行计算，然后将该函数应用于每一行。

def calculate_region_length(x):
    start_val = x[0]
    stop_val = x[1]
    try:
        start_val = float(start_val)
        return (stop_val - start_val) + 1.0
    except ValueError:
        return None

The custom function accepts a list as input. 自定义函数接受列表作为输入。 The function will test the start value to see if it can be converted into a float. 该函数将测试起始值，以查看是否可以将其转换为浮点数。 If it cannot then None will be returned. 如果它不能然后None将被退回。 This way if '1' is stored as a string the value can still be converted to float and won't be skipped whereas '$%%' in your example cannot and will return None . 这样，如果将'1'存储为字符串，则该值仍可以转换为float且不会被跳过，而示例中的'$ %%'不能并且将返回None 。

Next you call the custom function for each row: 接下来，为每行调用自定义函数：

df['length_of_region'] = df[['start', 'stop']].apply(lambda x: calculate_region_legnth(x), axis=1)

This will create your new column with (stop - start) + 1.0 for rows where start is not a non-convertible string and None where start is a string that cannot be converted to a number. 这将创建一个新的列(stop - start) + 1.0的行哪里start不是不可兑换串并None在那里start是不能转换为数字的字符串。

You can then update the Notes field based on rows where None is returned to identify the regions where a start value is missing: 然后，您可以根据返回“ None ”的行来更新“ Notes字段，以标识缺少起始值的区域：

df.loc[df['length_of_region'].isnull(), 'Notes'] = df['region_name']

Answer 2

After staring at the code for quite some time, found a simple and elegant fix to reassign df['start'][i] to start that I use in try-except as follows: 盯着代码看了一段时间后，找到了一个简单优雅的修复方法来重新分配df['start'][i]以start我在try-except ，如下所示：

for i in range(0, len(df['start'])):
    if pd.isnull(df['start'][i]) == True:
        df['Notes'][i] += 'Error: Missing value for chromosome start at region %s, required value;' % (df['region_name'][i])
        df['critical_error'][i] = True
        num_error = num_error+1
    else:
        try:
            start = int(df['start'][i])
            df['start'][i] = start
            if start == 0:
                raise ValueError
        except:
            df['Notes'][i] += 'Error: Chromosome start should be a non zero number at region %s; ' % (df['region_name'][i])
            #print df['start'][i]
            df['critical_error'][i] = True
            num_error = num_error+1
for i in range(0, len(df['start'][i])):
    if df['critical_error'][i] == True:
        continue
    df['length_of_region'][i] = (df['stop'][i] - df['start'][i]) + 1.0

Re-assigning the start variable, converts it into int format and helps to calculate length_of_region only for numeric columns 重新分配开始变量，将其转换为int格式，并有助于仅针对数字列计算length_of_region

如果在pandas数据框中非数字，则跳过对行的操作

问题描述

2 个解决方案

解决方案1
1 2017-08-09 17:23:56

解决方案2
0 已采纳 2017-08-09 18:32:27

如果在pandas数据框中非数字，则跳过对行的操作

问题描述

2 个解决方案

解决方案1 1 2017-08-09 17:23:56

解决方案2 0 已采纳 2017-08-09 18:32:27

解决方案1
1 2017-08-09 17:23:56

解决方案2
0 已采纳 2017-08-09 18:32:27