简体   繁体   English

根据另一列的名称重命名Pandas Multiindex

[英]Rename Pandas Multiindex based on another column's name

I've got a CSV file that is generated in a format I cannot change. 我有一个CSV文件,该文件以无法更改的格式生成。 The file has a multiindex: headers on two lines. 该文件具有multiindex:两行中的标题。 The first line (higher level of index) has blanks when the value doesn't change. 当值不变时,第一行(较高级别的索引)为空白。

What my header looks like: 我的标题如下所示:

标头会像这样

What it actually comes down to and what I want: 它实际上归结为我想要的:

标头实际上是什么

I would like to be able to process it correctly in Python 2.7 with Pandas. 我希望能够在Python 2.7中使用Pandas正确处理它。

I resulted to looping on the first level of index and if the value is blank, set it to be the same as the one on the left. 我导致在第一级索引上循环,如果该值为空白,则将其设置为与左侧的相同。

I start by loading the dataframe in pandas: 我首先在pandas中加载数据框:

df = pd.read_csv(myFile, header=[0,1], sep=',')
df

数据框已加载到Pandas中

I've tried the following: 我尝试了以下方法:

for i, val in enumerate(df.columns.values):
    if val[0][:7] == 'Unnamed':
        l.append([l[i-1][0], val[1]])
    else:
        l.append(val)

The list "l" I'm getting appears to be what I want: 我得到的列表“ l”似乎是我想要的:

[('Foo', 'A'),
 ['Foo', 'B'],
 ['Foo', 'C'],
 ('Bar', 'A'),
 ['Bar', 'B'],
 ['Bar', 'C']]

I've tried both: 我都尝试过:

df.columns = l 

Produces a non multiindex dataframe 产生一个非多索引数据框

平面数据框

index = pd.MultiIndex.from_tuples(l)
df.reindex(columns = index)

This one gives me the correct index, but values disappear. 这个给了我正确的索引,但是值消失了。

消失的价值

I'm getting a strong gut feeling that the entire approach I'm trying isn't very pythonic nor does it make sense to use a list then converted to a dict. 我有一种强烈的直觉,我正在尝试的整个方法不是很pythonic,使用列表然后转换为字典也没有意义。 Any idea how I can multiindex properly? 知道如何正确进行多索引吗?

Instead of using reindex , set the columns to your new index directly: 无需使用reindex ,而是直接将列设置为新索引:

df.columns = pd.MultiIndex.from_tuples(l)

That should produce the desired result. 那应该产生期望的结果。

reindex doesn't just replace the index values (though that sounds like what it should do, and the documentation isn't especially clear). reindex不仅替换索引值(尽管听起来像应该做的那样,而且文档也不是很清楚)。 Instead it goes through your new indices, picks the rows or columns that match the new indices, and puts NaN where no old index matches a new index. 相反,它将遍历您的新索引,选择与新索引匹配的行或列,并将NaN放在没有旧索引与新索引匹配的位置。 That's what's happening to you: when reindex hits ['Foo', 'B'] , which doesn't exist in your original dataframe, it fills the column in the new dataframe with NaN . 这就是您正在发生的事情:当reindex击中['Foo', 'B'] (原始数据帧中不存在)时,它将用NaN填充新数据帧中的列。

If your columns are always going to follow a consistent pattern (one top-level column for every three second-level columns, for example), you could also use MultiIndex.from_product to make the column index: 如果您的列始终遵循一致的模式(例如,每三个第二级列一个顶级列),则还可以使用MultiIndex.from_product来创建列索引:

iterables = [["Foo", "Bar"], ["A", "B", "C"]]
index = pd.MultiIndex.from_product(iterables)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM