简体   繁体   English

迭代df列并根据行索引,列引用返回数据帧中的值

[英]iterate through df column and return value in dataframe based on row index, column reference

My goal is to compare each value from the column "year" against the appropriate column year (ie 1999, 2000). 我的目标是将“年份”列中的每个值与相应的列年份(即1999年,2000年)进行比较。 I then want to return the corresponding value from the corresponding column. 然后我想从相应的列返回相应的值。 For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan. 例如,对于2004年的阿富汗(第一行),我想找到名为“2004”的列,并从包含阿富汗的行返回值。

Here is the table. 这是表格。 For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from. 作为参考,该表是1999年至2010年期间单个定义年份的教育程度和每个国家的gdp表之间的sql连接的结果。我的最终目标是从教育数据来自年份返回gdp。

country year    men_ed_yrs  women_ed_yrs    total_ed_yrs    1999    2000    2001    2002    2003    2004    2005    2006    2007    2008    2009    2010
0   Afghanistan 2004    11  5   8   NaN NaN 2461666315  4128818042  4583648922  5285461999  6.275076e+09    7.057598e+09    9.843842e+09    1.019053e+10    1.248694e+10    1.593680e+10
1   Albania 2004    11  11  11  3414760915  3632043908  4060758804  4435078648  5746945913  7314865176  8.158549e+09    8.992642e+09    1.070101e+10    1.288135e+10    1.204421e+10    1.192695e+10
2   Algeria 2005    13  13  13  48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11    1.170000e+11    1.350000e+11    1.710000e+11    1.370000e+11    1.610000e+11
3   Andorra 2008    11  12  11  1239840270  1401694156  1484004617  1717563533  2373836214  2916913449  3.248135e+09    3.536452e+09    4.010785e+09    4.001349e+09    3.649863e+09    3.346317e+09
4   Anguilla    2008    11  11  11  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

gdp_ed_list = []
for value in df_combined_column_named['year']: #loops through each year in year column
        if value in df_combined_column_named.columns: #compares year to column names
            idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist() #supposed to get the index associated with value
            gdp_ed = df_combined_column_named.get_value(idx, value) #get the value of the cell found at idx, value
            gdp_ed_list.append(gdp_ed) #append to a list

Currently, my code is getting stuck at the index.list() section. 目前,我的代码陷入了index.list()部分。 It is returning the error: 它返回错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-85-361acb97edd4> in <module>()
      2 for value in df_combined_column_named['year']: #loops through each year in year column
      3     if value in df_combined_column_named.columns: #compares year to column names
----> 4         idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist()
      5         gdp_ed = df_combined_column_named.get_value(idx, value)
      6         gdp_ed_list.append(gdp_ed)
KeyError: u'2004'

Any thoughts? 有什么想法吗?

It looks like you are trying to match the value in the year column to column labels and then extract the value in the corresponding cells. 您似乎尝试将year列中的值与列标签匹配,然后在相应的单元格中提取值。 You could do that by looping through the rows (see below) but I think it would be not be the fastest way. 你可以通过遍历行来实现这一点(见下文),但我认为这不是最快的方法。 Instead, you could use pd.melt to coalesce the columns with year-like labels into a single column, say, year_col : 相反,您可以使用pd.melt将具有类似年份的标签的列合并到一个列中,例如year_col

In [38]: melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 'total_ed_yrs'], var_name='year_col')

In [39]: melted
Out[39]: 
        country  year  men_ed_yrs  women_ed_yrs  total_ed_yrs year_col         value  
0   Afghanistan  2004          11             5             8     1999            NaN   
1       Albania  2004          11            11            11     1999   3.414761e+09   
2       Algeria  2005          13            13            13     1999   4.864061e+10   
3       Andorra  2008          11            12            11     1999   1.239840e+09   
4      Anguilla  2008          11            11            11     1999            NaN   
5   Afghanistan  2004          11             5             8     2000            NaN
...

The benefit of "melting" the DataFrame in this way is that now you would have both year and year_col columns. 以这种方式“熔化”DataFrame的好处是,现在您将拥有yearyear_col列。 The values you are looking for are in the rows where year equals year_col . 您要查找的值位于year等于year_col的行中 And that is easy to obtain by using .loc : 使用.loc很容易获得:

In [41]: melted.loc[melted['year'] == melted['year_col']]
Out[41]: 
        country  year  men_ed_yrs  women_ed_yrs  total_ed_yrs year_col  \
25  Afghanistan  2004          11             5             8     2004   
26      Albania  2004          11            11            11     2004   
32      Algeria  2005          13            13            13     2005   
48      Andorra  2008          11            12            11     2008   
49     Anguilla  2008          11            11            11     2008   

           value  
25  5.285462e+09  
26  7.314865e+09  
32  1.030000e+11  
48  4.001349e+09  
49           NaN  

Thus, you could use 因此,你可以使用

import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'1999': [nan, 3414760915.0, 48640611686.0, 1239840270.0, nan],
 '2000': [nan, 3632043908.0, 54790060513.0, 1401694156.0, nan],
 '2001': [2461666315.0, 4060758804.0, 54744714110.0, 1484004617.0, nan],
 '2002': [4128818042.0, 4435078648.0, 56760288396.0, 1717563533.0, nan],
 '2003': [4583648922.0, 5746945913.0, 67863829705.0, 2373836214.0, nan],
 '2004': [5285461999.0, 7314865176.0, 85324998959.0, 2916913449.0, nan],
 '2005': [6275076000.0, 8158549000.0, 103000000000.0, 3248135000.0, nan],
 '2006': [7057598000.0, 8992642000.0, 117000000000.0, 3536452000.0, nan],
 '2007': [9843842000.0, 10701010000.0, 135000000000.0, 4010785000.0, nan],
 '2008': [10190530000.0, 12881350000.0, 171000000000.0, 4001349000.0, nan],
 '2009': [12486940000.0, 12044210000.0, 137000000000.0, 3649863000.0, nan],
 '2010': [15936800000.0, 11926950000.0, 161000000000.0, 3346317000.0, nan],
 'country': ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Anguilla'],
 'men_ed_yrs': [11, 11, 13, 11, 11],
 'total_ed_yrs': [8, 11, 13, 11, 11],
 'women_ed_yrs': [5, 11, 13, 12, 11],
 'year': ['2004', '2004', '2005', '2008', '2008']})

melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 
                              'total_ed_yrs'], var_name='year_col')
result = melted.loc[melted['year'] == melted['year_col']]
print(result)

Why was a KeyError raised : 为什么会引发KeyError

The KeyError is being raised by df_combined_column_named['year'][value] . KeyError正在被提出df_combined_column_named['year'][value] Suppose value is '2004' . 假设value '2004' Then df_combined_column_named['year'] is a Series containing string representations of years and indexed by integers (like 0, 1, 2, ...). 然后df_combined_column_named['year']是一个包含df_combined_column_named['year']字符串表示的系列,并用整数索引(如0,1,2,...)。 df_combined_column_named['year'][value] fails because it attempts to index this Series with the string '2004' which is not in the integer index. df_combined_column_named['year'][value]失败,因为它尝试使用不在整数索引中的字符串'2004'索引此Series。


Alternatively, here is another way to achieve the goal by looping through the rows using iterrows . 或者,这是通过使用iterrows循环遍历来实现目标的另一种方法。 This is perhaps simpler to understand, but in general using iterrows is slow compared to other column-based Pandas-centric methods : 这可能更容易理解,但与其他基于列的Pandas中心方法相比 ,通常使用iterrows

data = []
for idx, row in df.iterrows():
    data.append((row['country'], row['year'], row[row['year']]))
result = pd.DataFrame(data, columns=['country', 'year', 'value'])
print(result)

prints 版画

       country  year         value
0  Afghanistan  2004  5.285462e+09
1      Albania  2004  7.314865e+09
2      Algeria  2005  1.030000e+11
3      Andorra  2008  4.001349e+09
4     Anguilla  2008           NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 遍历 DF 列的行并根据条件更改值 - Iterate Through Row of a DF Column and change value based on a condition 遍历 dataframe 中的每一行和每一列并对列值执行操作 - iterate through each row and column in dataframe and perform action on column value 遍历数据框中的行并根据其他列更改列的值 - Iterate through rows in a dataframe and change value of a column based on other column 遍历pandas数据框,使用if语句检查每个列值,并将该列值传递到空df的首选列 - Iterate through a pandas dataframe, check each column value with an if statement and pass the column values to the prefered column of an empty df 遍历 dataframe 并在元组中返回索引 label 和列 label - iterate through a dataframe and return index label and column label in a tuple 如何遍历 Pandas DF 中的列以检查某个值并返回同一行但来自不同列的值? - How to iterate over a column in a Pandas DF to check for a certain value and return a value in the same row but from a different column? 如何使用 df.loc 和键列遍历 dataframe - How to iterate through dataframe using df.loc and key column Pandas 遍历一个数据帧,将行值和列值连接到一个关于特定列值的新数据帧中 - Pandas-iterate through a dataframe concatenating row values and column values into a new dataframe with respect to a specific column value 根据列值从 df 访问一行 - Access a row from a df based on a column value 如何从行和列引用返回数据框值? - How to return a dataframe value from row and column reference?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM