[英]python recode csv with condition
I am a beginner in python.我是python的初学者。 I need to recode a CSV file:
我需要重新编码一个 CSV 文件:
unique_id,pid,Age
1,1,1
1,2,3
2,1,5
2,2,6
3,1,6
3,2,4
3,3,6
3,4,1
3,5,4
4,1,6
4,2,5
The condition is: for each [unique_id], if there is any [Age]==6, then put a value 1 in the corresponding rows of with a [pid]=1, others should be 0.条件是:对于每个[unique_id],如果有任何[Age]==6,则在[pid]=1的对应行中放一个值1,其他的应该是0。
the output csv will look like this:输出 csv 将如下所示:
unique_id,pid,Age,recode
1,1,1,0
1,2,3,0
2,1,5,1
2,2,6,0
3,1,6,1
3,2,4,0
3,3,6,0
3,4,1,0
3,5,4,0
4,1,6,1
4,2,5,0
I was using numpy: like follwoing:我正在使用 numpy: 就像以下内容:
import numpy
input_file1 = "data.csv"
input_folder = 'G:/My Drive/'
Her_HH =pd.read_csv(input_folder + input_file1)
Her_HH['recode'] = numpy.select([Her_PP['Age']==6,Her_PP['Age']<6], [1,0], default=Her_HH['recode'])
Her_HH.to_csv('recode_elderly.csv', index=False)
but it does not put value 1 in where [pid] is 1. Any help will be appreciated.但它不会将值 1 放在 [pid] 为 1 的位置。任何帮助将不胜感激。
You can use DataFrame.assign
for new column with GroupBy.transform
for test if at least one match by GroupBy.any
, chain mask for test 1
with &
for bitwise AND and last cast output to integers您可以将
DataFrame.assign
用于带有GroupBy.transform
新列,如果GroupBy.transform
至少有一个匹配GroupBy.any
,则可以使用DataFrame.assign
进行测试,测试1
链掩码与&
用于按位与,最后将输出转换为整数
#sorting if necessary
df = df.sort_values('unique_id')
m1 = df.assign(test=df['Age'] == 6).groupby('unique_id')['test'].transform('any')
Another idea for get groups with 6
is filter them with unique_id
and Series.isin
:获取
6
组的另一个想法是使用unique_id
和Series.isin
过滤它们:
m1 = df['unique_id'].isin(df.loc[df['Age'] == 6, 'unique_id'])
m2 = df['pid'] == 1
df['recode'] = (m1 & m2).astype(int)
print (df)
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
EDIT:编辑:
For check groups with no match 6 in Age column is possible filter by inverted mask by ~
and if want only all unique rows by unique_id
values add DataFrame.drop_duplicates
:对于 Age 列中没有匹配 6 的检查组,可以通过
~
倒置掩码过滤,如果只需要根据unique_id
值添加所有唯一行, unique_id
添加DataFrame.drop_duplicates
:
print (df[~m1])
unique_id pid Age
0 1 1 1
1 1 2 3
df1 = df[~m1].drop_duplicates('unique_id')
print (df1)
unique_id pid Age
0 1 1 1
This a bit clumsy, since I know numpy
a lot better than pandas
.这有点笨拙,因为我比
pandas
更了解numpy
。
Load your csv sample into a dataframe:将您的 csv 样本加载到数据框中:
In [205]: df = pd.read_csv('stack59885878.csv')
In [206]: df
Out[206]:
unique_id pid Age
0 1 1 1
1 1 2 3
2 2 1 5
3 2 2 6
4 3 1 6
5 3 2 4
6 3 3 6
7 3 4 1
8 3 5 4
9 4 1 6
10 4 2 5
Generate a groupby
object based on the unique_id
column:根据
unique_id
列生成groupby
对象:
In [207]: gps = df.groupby('unique_id')
In [209]: gps.groups
Out[209]:
{1: Int64Index([0, 1], dtype='int64'),
2: Int64Index([2, 3], dtype='int64'),
3: Int64Index([4, 5, 6, 7, 8], dtype='int64'),
4: Int64Index([9, 10], dtype='int64')}
I've seen pandas
ways for iterating on groups, but here's a list comprehension.我见过
pandas
迭代组的方法,但这里有一个列表理解。 The iteration produce a tuple, with the id and a dataframe.迭代产生一个元组,带有 id 和一个数据帧。 We want to test each group dataframe for 'Age' and 'pid' values:
我们要测试每个组数据框的“年龄”和“pid”值:
In [211]: recode_values = [(gp['Age']==6).any() & (gp['pid']==1) for x, gp in gps]
In [212]: recode_values
Out[212]:
[0 False
1 False
Name: pid, dtype: bool, 2 True
3 False
Name: pid, dtype: bool, 4 True
5 False
6 False
7 False
8 False
Name: pid, dtype: bool, 9 True
10 False
Name: pid, dtype: bool]
The result is a list of Series, with a True where pid
is 1 and there's a 'Age' 6 in the group.结果是一个系列列表,其中
pid
为 1,组中有一个“年龄”为 6 的 True。
Joining these Series with numpy.hstack
produces a boolean array, which we can convert to an integer array:将这些系列与
numpy.hstack
会产生一个布尔数组,我们可以将其转换为整数数组:
In [214]: np.hstack(recode_values)
Out[214]:
array([False, False, True, False, True, False, False, False, False,
True, False])
In [215]: df['recode']=_.astype(int) # assign that to a new column
In [216]: df
Out[216]:
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
Again, I think there's an idiomatic pandas way of joining those series.同样,我认为有一种加入这些系列的惯用熊猫方式。 But for now this works.
但现在这有效。
=== ===
OK, the groupby object has an apply
:好的, groupby 对象有一个
apply
:
In [223]: def foo(gp):
...: return (gp['Age']==6).any() & (gp['pid']==1).astype(int)
...:
In [224]: gps.apply(foo)
Out[224]:
unique_id
1 0 0
1 0
2 2 1
3 0
3 4 1
5 0
6 0
7 0
8 0
4 9 1
10 0
Name: pid, dtype: int64
And remove the multi-indexing with:并删除多索引:
In [242]: gps.apply(foo).reset_index(0, True)
Out[242]:
0 0
1 0
2 1
3 0
4 1
5 0
6 0
7 0
8 0
9 1
10 0
Name: pid, dtype: int64
In [243]: df['recode']=_ # and assign to recode
Lots of experimenting and learning here.在这里进行了大量的实验和学习。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.