简体   繁体   English

根据另一列的值将列添加到pandas数据框中

[英]Adding columns to a pandas dataframe based on values of another column

This is part of an ongoing series of issues I'm having trying to condense a csv file with multiple rows for each client based on the number of medical services they received. 这是我正在进行的一系列问题的一部分,我正在尝试根据每个客户收到的医疗服务数量来压缩多行csv文件。 For each service, they have a row. 对于每种服务,它们都有一行。 I've included the dataframe at the bottom. 我在底部包括了数据框。

I'm trying to calculate how many times a client (identified with an ID_profile number) got each type of service and add that to a column named for the type of service. 我正在尝试计算客户机(用ID_profile号标识)获得每种服务的次数,并将其添加到以服务类型命名的列中。 So, if a client got 3 Early Intervention Services, I would add the number "3" to the "eisserv" column. 因此,如果客户获得3个早期干预服务,我将在“ eisserv”列中添加数字“ 3”。 Once that is done, I want to combine all the client rows into one. 完成之后,我想将所有客户端行合并为一个。

Where I'm getting stuck is populating 3 different columns with data based off one column. 我陷入困境的地方是根据一列数据填充3个不同的列。 I am trying to iterate through the rows using some strings for the function to compare to. 我正在尝试使用一些字符串来比较要比较的行。 The function works, but for reasons I can't understand, all the strings change to "25" as the function works. 该函数有效,但是由于我无法理解的原因,该函数正常工作时所有字符串都变为“ 25”。

import pandas as pd
df = pd.read_csv('fakeRWclient.csv')

df['PrimaryServiceCategory'] = df['PrimaryServiceCategory'].map({'Referral for Health Care/Supportive Services': '33', 'Health Education/Risk reduction': '25', 'Early Intervention Services (Parts A and B)': '11'})

df['ServiceDate'] = pd.to_datetime(df['ServiceDate'], format="%m/%d/%Y")
df['id_profile'] = df['id_profile'].apply(str)
df['served'] = df['id_profile']  + " " + df['PrimaryServiceCategory']

df['count'] = df['served'].map(df['served'].value_counts())
eis = "11"
ref = "33"
her = "25"
print("Here are the string values")
print(eis)
print(ref)
print(her)
df['herrserv']=""
df['refserv']=""
df['eisserv']=""
for index in df.itertuples():
    for eis in df['PrimaryServiceCategory']:
        df['eisserv'] = df['count']
    for her in df['PrimaryServiceCategory']:
        df['herrserv'] = df['count']
    for ref in df['PrimaryServiceCategory']:
        df['refserv'] = df['count']
print("Here are the string values")
print(eis)
print(ref)
print(her)

Here is the output: 这是输出:

Here are the string values
11
33
25
Here are the string values
25
25
25
  id_profile ServiceDate PrimaryServiceCategory     served  count  herrserv  
\
0        439  2017-12-05                     25     439 25      1         1   
1     444654  2017-01-25                     25  444654 25      2         2   
2      56454  2017-12-05                     33   56454 33      1         1   
3      56454  2017-01-25                     25   56454 25      2         2   
4     444654  2017-03-01                     25  444654 25      2         2   
5      56454  2017-01-01                     25   56454 25      2         2   
6      12222  2017-01-05                     11   12222 11      1         1   
7      12222  2017-01-30                     25   12222 25      3         3   
8      12222  2017-03-01                     25   12222 25      3         3   
9      12222  2017-03-20                     25   12222 25      3         3   

   refserv  eisserv  
0        1        1  
1        2        2  
2        1        1  
3        2        2  
4        2        2  
5        2        2  
6        1        1  
7        3        3  
8        3        3  
9        3        3  

Why do the string values switch? 为什么要切换字符串值? And is this even the right function to do what I'm hoping to do? 这甚至是执行我希望执行的功能的正确方法吗?

You can use pandas.get_dummies after mapping your integers to categories, then merge with your dataframe. 将整数映射到类别后,可以使用pandas.get_dummies ,然后将其与数据pandas.get_dummies合并。

You can add a 'count' column summing the 3 category counts afterwords. 您可以添加一个“计数”列,该列总计3个类别计数后缀。

df = pd.DataFrame({'id_profile': [439, 444654, 56454, 56454, 444654, 56454, 12222, 12222, 12222, 12222],
                   'ServiceDate': ['2017-12-05', '2017-01-25', '2017-12-05', '2017-01-25', '2017-03-01', '2017-01-01', '2017-01-05', '2017-01-30', '2017-03-01', '2017-03-20'],
                   'PrimaryServiceCategory': [25, 25, 33, 25, 25, 25, 11, 25, 25, 25]})

d = {11: 'eis', 33: 'ref', 25: 'her'}
df['Service'] = df['PrimaryServiceCategory'].map(d)

df = df.set_index('id_profile')\
       .join(pd.get_dummies(df.drop('PrimaryServiceCategory', 1), columns=['Service'])\
               .groupby(['id_profile']).sum())

#            ServiceDate  PrimaryServiceCategory Service  Service_eis  \
# id_profile                                                            
# 439         2017-12-05                      25     her            0   
# 12222       2017-01-05                      11     eis            1   
# 12222       2017-01-30                      25     her            1   
# 12222       2017-03-01                      25     her            1   
# 12222       2017-03-20                      25     her            1   
# 56454       2017-12-05                      33     ref            0   
# 56454       2017-01-25                      25     her            0   
# 56454       2017-01-01                      25     her            0   
# 444654      2017-01-25                      25     her            0   
# 444654      2017-03-01                      25     her            0   

#             Service_her  Service_ref  
# id_profile                            
# 439                   1            0  
# 12222                 3            0  
# 12222                 3            0  
# 12222                 3            0  
# 12222                 3            0  
# 56454                 2            1  
# 56454                 2            1  
# 56454                 2            1  
# 444654                2            0  
# 444654                2            0  

I have made changes to your existing code only. 我仅对您现有的代码进行了更改。

    import pandas as pd
    df = pd.read_csv('fakeRWclient.csv')

    df['PrimaryServiceCategory'] = df['PrimaryServiceCategory'].map({'Referral for Health Care/Supportive Services': '33', 'Health Education/Risk reduction': '25', 'Early Intervention Services (Parts A and B)': '11'})

    df['ServiceDate'] = pd.to_datetime(df['ServiceDate'], format="%m/%d/%Y")
    df['id_profile'] = df['id_profile'].apply(str)

    print(df.groupby('id_profile').PrimaryServiceCategory.count())

Above code will give output like this: 上面的代码将给出如下输出:

id_profile
439       1
12222     4
56454     3
444654    2

The values of eis , ref and her switch to "25" because you are looping over the variable PrimaryServiceCategory , and the last value in that serie is "25". eisrefher的值切换为“ 25”,因为您正在遍历变量PrimaryServiceCategory ,并且该系列中的最后一个值为“ 25”。 You are using eis , ref and her as the names of the iterator variable, so they change in every loop. 您将eisrefher用作迭代器变量的名称,因此它们在每个循环中都会更改。 I think this is an inefficient way to do it. 我认为这是一种低效的方法。 It's better if you use groupby and transform: 如果使用groupby并进行转换,则更好:

df['count'] = df.groupby(['id_profile','PrimaryServiceCategory']).transform('count')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将值添加到基于另一个 dataframe 的 pandas dataframe 列 - adding values to pandas dataframe columns based on another dataframe 根据另一个数据帧的列值的条件将数据添加到数据帧中的列 - Adding data to columns in a dataframe based on condition on column values of another dataframe 根据来自另一个 dataframe pandas 的值添加列 - Adding a column based on the values from another dataframe pandas 插入几个新列,其值基于 pandas 中 Dataframe 中的另一列 - Insert several new column with the values based on another columns in a Dataframe in pandas 基于其他列向 pandas dataframe 添加列 - Adding a column to a pandas dataframe based on other columns 根据另一个数据框中的值将列添加到数据框中 - Adding column to dataframe based on values in another dataframe Pandas 基于另一个 DataFrame 修改列值 - Pandas modify column values based on another DataFrame Pandas 根据列值将 Dataframe 划分为另一个 - Pandas Divide Dataframe by Another Based on Column Values 遍历一列并根据 PANDAS dataframe 中另一列的值将值添加到列表 - Iterating through one column and adding values to list based on value of another column in a PANDAS dataframe #Pandas,根据来自另一列的 header 名称(列名称)向列添加值 - #Pandas, adding to values to columns based on header name(column name) from another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM