简体   繁体   English

根据熊猫数据框的索引定义列表

[英]Defining lists based on indices of pandas dataframe

I have a pandas dataframe, and one of the columns has date values as strings (like "2014-01-01"). 我有一个pandas数据框,其中一列的日期值是字符串(例如“ 2014-01-01”)。 I would like to define a different list for each year that is present in the column, where the elements of the list are the index of the row in which the year is found in the dataframe. 我想为列中存在的每年定义一个不同的列表,其中列表的元素是在数据框中找到年份的行的索引。

Here's what I've tried: 这是我尝试过的:

import pandas as pd    

df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
df = df.values.flatten().tolist()

for i in range(len(df)):
    df[i] = df[i][0:4]

y2012 = []; y2013 = []; y2014 = []

for i in range(len(df)):
    if df[i] == "2012":
        y2012.append(i)
    elif df[i] == "2013":
        y2013.append(i)
    else:
        y2014.append(i)

print y2014 # [0, 2]
print y2013 # [1]
print y2012 # [3]

Does anyone know a better way of doing this? 有谁知道这样做的更好方法? This way works fine, but I have a lot of years, so I have to manually define each variable and then run it through the for loop, and so the code gets really long. 这种方法很好用,但是我已经有很多年了,所以我必须手动定义每个变量,然后通过for循环运行它,因此代码变得很长。 I was trying to use groupby in pandas, but I couldn't seem to get it to work. 我试图在熊猫中使用groupby ,但似乎无法使其正常工作。

Thank you so much for any help! 非常感谢您的帮助!

Scan through the original DataFrame values and parse out the year. 扫描原始DataFrame值并解析年份。 Given, that, add the index into a defaultdict. 鉴于此,将索引添加到defaultdict中。 That is, the following code creates a dict , one item per year. 也就是说,以下代码创建了一个dict ,每年一项。 The value for a specific year is a list of the rows in which the year is found in the dataframe. 特定年份的值是在数据框中找到年份的行的列表。

A defaultdict sounds scary, but it's just a dictionary. defaultdict听起来很吓人,但这只是一本字典。 In this case, each value is a list. 在这种情况下,每个值都是一个列表。 If we append to a nonexistent value, then it gets spontaneously created. 如果我们append一个不存在的值,则会自发创建它。 Convenient! 方便!

source 资源

from collections import defaultdict
import pandas as pd    

df = pd.DataFrame(["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"])
# df = df.values.flatten().tolist()

dindex = defaultdict(list)
for index,dateval in enumerate(df.values):
    year = dateval[0].split('-')[0]
    dindex[year].append(index)

assert dindex == {'2014': [0, 2], '2013': [1], '2012': [3]}
print dindex

output 输出

defaultdict(<type 'list'>, {'2014': [0, 2], '2013': [1], '2012': [3]})

Pandas is awesome for this kind of thing, so don't be so hasty to turn your dataframe back into lists right away. Pandas对于这种事情非常棒 ,因此不要急于立即将数据框变回列表。

The trick here lies in the .apply() method and the .groupby() method. 这里的窍门在于.apply()方法和.groupby()方法。

  1. Take a dataframe that has strings with ISO formatted dates in it 取一个数据框,其中包含带有ISO格式日期的字符串
  2. parse the column containing the date strings into datetime objects 将包含日期字符串的列解析为datetime对象
  3. Create another column of years using the datetime.year attribute of the items in the datetime column 使用datetime列中各项的datetime.year属性创建年份的另一列
  4. Group the dataframe by the new year column 按“新年”列对数据框进行分组
  5. Iterate over the groupby object and extract your column 遍历groupby对象并提取列

Here's some code for you to play with and grok: 这是供您玩和玩的代码:

import pandas
import dateutil

df = pd.DataFrame({'strings': ["2014-01-01","2013-01-01","2014-02-02", "2012-08-09"]})
df['datetimes'] = df['strings'].apply(dateutil.parser.parse)
df['year'] = df['datetimes'].apply(lambda x: x.year)
grouped_data= df.groupby('year')

lists_by_year = {}
for year, data in grouped_data
    lists_by_year [year] = list(data['strings'])

Which gives us a dictionary of lists, where the key is the year and the contents is a list of strings with that year. 这给了我们一个列表字典,其中的关键是年份,内容是带有该年份的字符串列表。

print lists_by_year 

{2012: ['2012-08-09'],
 2013: ['2013-01-01'],
 2014: ['2014-01-01', '2014-02-02']}

As it turns out 事实证明

df.groupby('A') #is just syntactical sugar for df.groupby(df['A'])

This means that all you have to do to group by year is leverage the apply function and re-work the syntax 这意味着您要按年份分组的全部工作就是利用apply函数并重新编写语法

Solution

getYear = lambda x:x.split("-")[0]
yearGroups = df.groupby(df["dates"].apply(getYear))

Output 输出量

for key,group in yearGroups: 
    print key

2012
2013
2014

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM