[英]Defining lists based on indices of pandas dataframe
I have a pandas dataframe, and one of the columns has date values as strings (like "2014-01-01"). 我有一个pandas数据框,其中一列的日期值是字符串(例如“ 2014-01-01”)。 I would like to define a different list for each year that is present in the column, where the elements of the list are the index of the row in which the year is found in the dataframe. 我想为列中存在的每年定义一个不同的列表,其中列表的元素是在数据框中找到年份的行的索引。
Here's what I've tried: 这是我尝试过的:
import pandas as pd
df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
df = df.values.flatten().tolist()
for i in range(len(df)):
df[i] = df[i][0:4]
y2012 = []; y2013 = []; y2014 = []
for i in range(len(df)):
if df[i] == "2012":
y2012.append(i)
elif df[i] == "2013":
y2013.append(i)
else:
y2014.append(i)
print y2014 # [0, 2]
print y2013 # [1]
print y2012 # [3]
Does anyone know a better way of doing this? 有谁知道这样做的更好方法? This way works fine, but I have a lot of years, so I have to manually define each variable and then run it through the for loop, and so the code gets really long. 这种方法很好用,但是我已经有很多年了,所以我必须手动定义每个变量,然后通过for循环运行它,因此代码变得很长。 I was trying to use groupby
in pandas, but I couldn't seem to get it to work. 我试图在熊猫中使用groupby
,但似乎无法使其正常工作。
Thank you so much for any help! 非常感谢您的帮助!
Scan through the original DataFrame
values and parse out the year. 扫描原始DataFrame
值并解析年份。 Given, that, add the index into a defaultdict. 鉴于此,将索引添加到defaultdict中。 That is, the following code creates a dict
, one item per year. 也就是说,以下代码创建了一个dict
,每年一项。 The value for a specific year is a list of the rows in which the year is found in the dataframe. 特定年份的值是在数据框中找到年份的行的列表。
A defaultdict sounds scary, but it's just a dictionary. defaultdict听起来很吓人,但这只是一本字典。 In this case, each value is a list. 在这种情况下,每个值都是一个列表。 If we append
to a nonexistent value, then it gets spontaneously created. 如果我们append
一个不存在的值,则会自发创建它。 Convenient! 方便!
from collections import defaultdict
import pandas as pd
df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
# df = df.values.flatten().tolist()
dindex = defaultdict(list)
for index,dateval in enumerate(df.values):
year = dateval[0].split('-')[0]
dindex[year].append(index)
assert dindex == {'2014': [0, 2], '2013': [1], '2012': [3]}
print dindex
defaultdict(<type 'list'>, {'2014': [0, 2], '2013': [1], '2012': [3]})
Pandas is awesome for this kind of thing, so don't be so hasty to turn your dataframe back into lists right away. Pandas对于这种事情非常棒 ,因此不要急于立即将数据框变回列表。
The trick here lies in the .apply()
method and the .groupby()
method. 这里的窍门在于.apply()
方法和.groupby()
方法。
Here's some code for you to play with and grok: 这是供您玩和玩的代码:
import pandas
import dateutil
df = pd.DataFrame({'strings': ["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"]})
df['datetimes'] = df['strings'].apply(dateutil.parser.parse)
df['year'] = df['datetimes'].apply(lambda x: x.year)
grouped_data= df.groupby('year')
lists_by_year = {}
for year, data in grouped_data
lists_by_year [year] = list(data['strings'])
Which gives us a dictionary of lists, where the key is the year and the contents is a list of strings with that year. 这给了我们一个列表字典,其中的关键是年份,内容是带有该年份的字符串列表。
print lists_by_year
{2012: ['2012-08-09'],
2013: ['2013-01-01'],
2014: ['2014-01-01', '2014-02-02']}
As it turns out 事实证明
df.groupby('A') #is just syntactical sugar for df.groupby(df['A'])
This means that all you have to do to group by year is leverage the apply function and re-work the syntax 这意味着您要按年份分组的全部工作就是利用apply函数并重新编写语法
Solution 解
getYear = lambda x:x.split("-")[0]
yearGroups = df.groupby(df["dates"].apply(getYear))
Output 输出量
for key,group in yearGroups:
print key
2012
2013
2014
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.