[英]iterate through df column and return value in dataframe based on row index, column reference
My goal is to compare each value from the column "year" against the appropriate column year (ie 1999, 2000). 我的目标是将“年份”列中的每个值与相应的列年份(即1999年,2000年)进行比较。 I then want to return the corresponding value from the corresponding column.
然后我想从相应的列返回相应的值。 For example, for Afghanistan (first row), year 2004, I want to find the column named "2004" and return the value from the row that contains afghanistan.
例如,对于2004年的阿富汗(第一行),我想找到名为“2004”的列,并从包含阿富汗的行返回值。
Here is the table. 这是表格。 For reference this table is the result of a sql join between educational attainment in a single defined year and a table for gdp per country for years 1999 - 2010. My ultimate goal is to return the gdp from the year that the educational data is from.
作为参考,该表是1999年至2010年期间单个定义年份的教育程度和每个国家的gdp表之间的sql连接的结果。我的最终目标是从教育数据来自年份返回gdp。
country year men_ed_yrs women_ed_yrs total_ed_yrs 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
0 Afghanistan 2004 11 5 8 NaN NaN 2461666315 4128818042 4583648922 5285461999 6.275076e+09 7.057598e+09 9.843842e+09 1.019053e+10 1.248694e+10 1.593680e+10
1 Albania 2004 11 11 11 3414760915 3632043908 4060758804 4435078648 5746945913 7314865176 8.158549e+09 8.992642e+09 1.070101e+10 1.288135e+10 1.204421e+10 1.192695e+10
2 Algeria 2005 13 13 13 48640611686 54790060513 54744714110 56760288396 67863829705 85324998959 1.030000e+11 1.170000e+11 1.350000e+11 1.710000e+11 1.370000e+11 1.610000e+11
3 Andorra 2008 11 12 11 1239840270 1401694156 1484004617 1717563533 2373836214 2916913449 3.248135e+09 3.536452e+09 4.010785e+09 4.001349e+09 3.649863e+09 3.346317e+09
4 Anguilla 2008 11 11 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
gdp_ed_list = []
for value in df_combined_column_named['year']: #loops through each year in year column
if value in df_combined_column_named.columns: #compares year to column names
idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist() #supposed to get the index associated with value
gdp_ed = df_combined_column_named.get_value(idx, value) #get the value of the cell found at idx, value
gdp_ed_list.append(gdp_ed) #append to a list
Currently, my code is getting stuck at the index.list() section. 目前,我的代码陷入了index.list()部分。 It is returning the error:
它返回错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-85-361acb97edd4> in <module>()
2 for value in df_combined_column_named['year']: #loops through each year in year column
3 if value in df_combined_column_named.columns: #compares year to column names
----> 4 idx = df_combined_column_named[df_combined_column_named['year'][value]].index.tolist()
5 gdp_ed = df_combined_column_named.get_value(idx, value)
6 gdp_ed_list.append(gdp_ed)
KeyError: u'2004'
Any thoughts? 有什么想法吗?
It looks like you are trying to match the value in the year
column to column labels and then extract the value in the corresponding cells. 您似乎尝试将
year
列中的值与列标签匹配,然后在相应的单元格中提取值。 You could do that by looping through the rows (see below) but I think it would be not be the fastest way. 你可以通过遍历行来实现这一点(见下文),但我认为这不是最快的方法。 Instead, you could use
pd.melt
to coalesce the columns with year-like labels into a single column, say, year_col
: 相反,您可以使用
pd.melt
将具有类似年份的标签的列合并到一个列中,例如year_col
:
In [38]: melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs', 'total_ed_yrs'], var_name='year_col')
In [39]: melted
Out[39]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col value
0 Afghanistan 2004 11 5 8 1999 NaN
1 Albania 2004 11 11 11 1999 3.414761e+09
2 Algeria 2005 13 13 13 1999 4.864061e+10
3 Andorra 2008 11 12 11 1999 1.239840e+09
4 Anguilla 2008 11 11 11 1999 NaN
5 Afghanistan 2004 11 5 8 2000 NaN
...
The benefit of "melting" the DataFrame in this way is that now you would have both year
and year_col
columns. 以这种方式“熔化”DataFrame的好处是,现在您将拥有
year
和year_col
列。 The values you are looking for are in the rows where year
equals year_col
. 您要查找的值位于
year
等于year_col
的行中 。 And that is easy to obtain by using .loc
: 使用
.loc
很容易获得:
In [41]: melted.loc[melted['year'] == melted['year_col']]
Out[41]:
country year men_ed_yrs women_ed_yrs total_ed_yrs year_col \
25 Afghanistan 2004 11 5 8 2004
26 Albania 2004 11 11 11 2004
32 Algeria 2005 13 13 13 2005
48 Andorra 2008 11 12 11 2008
49 Anguilla 2008 11 11 11 2008
value
25 5.285462e+09
26 7.314865e+09
32 1.030000e+11
48 4.001349e+09
49 NaN
Thus, you could use 因此,你可以使用
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame({'1999': [nan, 3414760915.0, 48640611686.0, 1239840270.0, nan],
'2000': [nan, 3632043908.0, 54790060513.0, 1401694156.0, nan],
'2001': [2461666315.0, 4060758804.0, 54744714110.0, 1484004617.0, nan],
'2002': [4128818042.0, 4435078648.0, 56760288396.0, 1717563533.0, nan],
'2003': [4583648922.0, 5746945913.0, 67863829705.0, 2373836214.0, nan],
'2004': [5285461999.0, 7314865176.0, 85324998959.0, 2916913449.0, nan],
'2005': [6275076000.0, 8158549000.0, 103000000000.0, 3248135000.0, nan],
'2006': [7057598000.0, 8992642000.0, 117000000000.0, 3536452000.0, nan],
'2007': [9843842000.0, 10701010000.0, 135000000000.0, 4010785000.0, nan],
'2008': [10190530000.0, 12881350000.0, 171000000000.0, 4001349000.0, nan],
'2009': [12486940000.0, 12044210000.0, 137000000000.0, 3649863000.0, nan],
'2010': [15936800000.0, 11926950000.0, 161000000000.0, 3346317000.0, nan],
'country': ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Anguilla'],
'men_ed_yrs': [11, 11, 13, 11, 11],
'total_ed_yrs': [8, 11, 13, 11, 11],
'women_ed_yrs': [5, 11, 13, 12, 11],
'year': ['2004', '2004', '2005', '2008', '2008']})
melted = pd.melt(df, id_vars=['country', 'year', 'men_ed_yrs', 'women_ed_yrs',
'total_ed_yrs'], var_name='year_col')
result = melted.loc[melted['year'] == melted['year_col']]
print(result)
Why was a KeyError
raised : 为什么会引发
KeyError
:
The KeyError
is being raised by df_combined_column_named['year'][value]
. 该
KeyError
正在被提出df_combined_column_named['year'][value]
。 Suppose value
is '2004'
. 假设
value
'2004'
。 Then df_combined_column_named['year']
is a Series containing string representations of years and indexed by integers (like 0, 1, 2, ...). 然后
df_combined_column_named['year']
是一个包含df_combined_column_named['year']
字符串表示的系列,并用整数索引(如0,1,2,...)。 df_combined_column_named['year'][value]
fails because it attempts to index this Series with the string '2004'
which is not in the integer index. df_combined_column_named['year'][value]
失败,因为它尝试使用不在整数索引中的字符串'2004'
索引此Series。
Alternatively, here is another way to achieve the goal by looping through the rows using iterrows
. 或者,这是通过使用
iterrows
循环遍历来实现目标的另一种方法。 This is perhaps simpler to understand, but in general using iterrows
is slow compared to other column-based Pandas-centric methods : 这可能更容易理解,但与其他基于列的Pandas中心方法相比 ,通常使用
iterrows
很慢 :
data = []
for idx, row in df.iterrows():
data.append((row['country'], row['year'], row[row['year']]))
result = pd.DataFrame(data, columns=['country', 'year', 'value'])
print(result)
prints 版画
country year value
0 Afghanistan 2004 5.285462e+09
1 Albania 2004 7.314865e+09
2 Algeria 2005 1.030000e+11
3 Andorra 2008 4.001349e+09
4 Anguilla 2008 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.