简体   繁体   English

python用条件重新编码csv

[英]python recode csv with condition

I am a beginner in python.我是python的初学者。 I need to recode a CSV file:我需要重新编码一个 CSV 文件:

unique_id,pid,Age
1,1,1
1,2,3
2,1,5
2,2,6
3,1,6
3,2,4
3,3,6
3,4,1
3,5,4
4,1,6
4,2,5

The condition is: for each [unique_id], if there is any [Age]==6, then put a value 1 in the corresponding rows of with a [pid]=1, others should be 0.条件是:对于每个[unique_id],如果有任何[Age]==6,则在[pid]=1的对应行中放一个值1,其他的应该是0。

the output csv will look like this:输出 csv 将如下所示:

unique_id,pid,Age,recode
1,1,1,0
1,2,3,0
2,1,5,1
2,2,6,0
3,1,6,1
3,2,4,0
3,3,6,0
3,4,1,0
3,5,4,0
4,1,6,1
4,2,5,0

I was using numpy: like follwoing:我正在使用 numpy: 就像以下内容:

import numpy
input_file1 = "data.csv"
input_folder = 'G:/My Drive/'
Her_HH =pd.read_csv(input_folder + input_file1)
Her_HH['recode'] = numpy.select([Her_PP['Age']==6,Her_PP['Age']<6], [1,0], default=Her_HH['recode'])

Her_HH.to_csv('recode_elderly.csv', index=False)

but it does not put value 1 in where [pid] is 1. Any help will be appreciated.但它不会将值 1 放在 [pid] 为 1 的位置。任何帮助将不胜感激。

You can use DataFrame.assign for new column with GroupBy.transform for test if at least one match by GroupBy.any , chain mask for test 1 with & for bitwise AND and last cast output to integers您可以将DataFrame.assign用于带有GroupBy.transform新列,如果GroupBy.transform至少有一个匹配GroupBy.any ,则可以使用DataFrame.assign进行测试,测试1链掩码与&用于按位与,最后将输出转换为整数

#sorting if necessary
df = df.sort_values('unique_id')

m1 = df.assign(test=df['Age'] == 6).groupby('unique_id')['test'].transform('any')

Another idea for get groups with 6 is filter them with unique_id and Series.isin :获取6组的另一个想法是使用unique_idSeries.isin过滤它们:

m1 = df['unique_id'].isin(df.loc[df['Age'] == 6, 'unique_id'])

m2 = df['pid'] == 1

df['recode'] = (m1 & m2).astype(int)
print (df)
    unique_id  pid  Age  recode
0           1    1    1       0
1           1    2    3       0
2           2    1    5       1
3           2    2    6       0
4           3    1    6       1
5           3    2    4       0
6           3    3    6       0
7           3    4    1       0
8           3    5    4       0
9           4    1    6       1
10          4    2    5       0

EDIT:编辑:

For check groups with no match 6 in Age column is possible filter by inverted mask by ~ and if want only all unique rows by unique_id values add DataFrame.drop_duplicates :对于 Age 列中没有匹配 6 的检查组,可以通过~倒置掩码过滤,如果只需要根据unique_id值添加所有唯一行, unique_id添加DataFrame.drop_duplicates

print (df[~m1])
   unique_id  pid  Age
0          1    1    1
1          1    2    3

df1 = df[~m1].drop_duplicates('unique_id')
print (df1)
   unique_id  pid  Age
0          1    1    1

This a bit clumsy, since I know numpy a lot better than pandas .这有点笨拙,因为我比pandas更了解numpy

Load your csv sample into a dataframe:将您的 csv 样本加载到数据框中:

In [205]: df = pd.read_csv('stack59885878.csv')                                                  
In [206]: df                                                                                     
Out[206]: 
    unique_id  pid  Age
0           1    1    1
1           1    2    3
2           2    1    5
3           2    2    6
4           3    1    6
5           3    2    4
6           3    3    6
7           3    4    1
8           3    5    4
9           4    1    6
10          4    2    5

Generate a groupby object based on the unique_id column:根据unique_id列生成groupby对象:

In [207]: gps = df.groupby('unique_id')                                                          

In [209]: gps.groups                                                                             
Out[209]: 
{1: Int64Index([0, 1], dtype='int64'),
 2: Int64Index([2, 3], dtype='int64'),
 3: Int64Index([4, 5, 6, 7, 8], dtype='int64'),
 4: Int64Index([9, 10], dtype='int64')}

I've seen pandas ways for iterating on groups, but here's a list comprehension.我见过pandas迭代组的方法,但这里有一个列表理解。 The iteration produce a tuple, with the id and a dataframe.迭代产生一个元组,带有 id 和一个数据帧。 We want to test each group dataframe for 'Age' and 'pid' values:我们要测试每个组数据框的“年龄”和“pid”值:

In [211]: recode_values = [(gp['Age']==6).any() & (gp['pid']==1) for x, gp in gps]               
In [212]: recode_values                                                                          
Out[212]: 
[0    False
 1    False
 Name: pid, dtype: bool, 2     True
 3    False
 Name: pid, dtype: bool, 4     True
 5    False
 6    False
 7    False
 8    False
 Name: pid, dtype: bool, 9      True
 10    False
 Name: pid, dtype: bool]

The result is a list of Series, with a True where pid is 1 and there's a 'Age' 6 in the group.结果是一个系列列表,其中pid为 1,组中有一个“年龄”为 6 的 True。

Joining these Series with numpy.hstack produces a boolean array, which we can convert to an integer array:将这些系列与numpy.hstack会产生一个布尔数组,我们可以将其转换为整数数组:

In [214]: np.hstack(recode_values)                                                               
Out[214]: 
array([False, False,  True, False,  True, False, False, False, False,
        True, False])
In [215]: df['recode']=_.astype(int)            # assign that to a new column
In [216]: df                                                                                     
Out[216]: 
    unique_id  pid  Age  recode
0           1    1    1       0
1           1    2    3       0
2           2    1    5       1
3           2    2    6       0
4           3    1    6       1
5           3    2    4       0
6           3    3    6       0
7           3    4    1       0
8           3    5    4       0
9           4    1    6       1
10          4    2    5       0

Again, I think there's an idiomatic pandas way of joining those series.同样,我认为有一种加入这些系列的惯用熊猫方式。 But for now this works.但现在这有效。

=== ===

OK, the groupby object has an apply :好的, groupby 对象有一个apply

In [223]: def foo(gp): 
     ...:     return (gp['Age']==6).any() & (gp['pid']==1).astype(int) 
     ...:                                                                                        
In [224]: gps.apply(foo)                                                                         
Out[224]: 
unique_id    
1          0     0
           1     0
2          2     1
           3     0
3          4     1
           5     0
           6     0
           7     0
           8     0
4          9     1
           10    0
Name: pid, dtype: int64

And remove the multi-indexing with:并删除多索引:

In [242]: gps.apply(foo).reset_index(0, True)                                                    
Out[242]: 
0     0
1     0
2     1
3     0
4     1
5     0
6     0
7     0
8     0
9     1
10    0
Name: pid, dtype: int64
In [243]: df['recode']=_     # and assign to recode

Lots of experimenting and learning here.在这里进行了大量的实验和学习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM