简体   繁体   English

带有数据框的熊猫重新索引

[英]pandas reindex with data frame

I have a DataFrame with a multiindex with three levels, for instance: 我有一个带有三个级别的多索引的DataFrame,例如:

                   COL1  COL2  ...
CHROM  POS  LABEL                 
chr1   43   strA   ...   ...   ...
            strB   ...   ...   ...
       66   strC   ...   ...   ...
            strB   ...   ...   ...
chr2   29   strD   ...   ...   ...
...    ...  ...    ...   ...   ...

and a Series with a multiindex with the first two levels of the DataFrame index: 以及具有带有DataFrame索引的前两个级别的multiindex的Series:

            VAL
CHROM  POS     
chr1   43   v1
       66   v2
chr2   29   v3
...    ...  ...

I would like to add a column with the Series to the DataFrame, repeating the values v1, v2... for every index whose first two levels match, like this: 我想向DataFrame添加一个带有Seri​​es的列,为前两个级别匹配的每个索引重复值v1,v2 ...,如下所示:

                   COL1  COL2  NEW  ...
CHROM  POS  LABEL                 
chr1   43   strA   ...   ...   v1   ...
            strB   ...   ...   v1   ...
       66   strC   ...   ...   v2   ...
            strB   ...   ...   v2   ...
chr2   29   strD   ...   ...   v3   ...
...    ...  ...    ...   ...   ...  ...

Note that the Series has no missing rows, that is, all (CHROM,POS) in the DataFrame are also in the Series. 请注意,该系列没有丢失的行,也就是说,DataFrame中的所有(CHROM,POS)也都在该系列中。 I have a working solution: 我有一个可行的解决方案:

pandas.Series(variant_db.index.map(lambda i: cov_per_sample[sample].loc[i[:2]]), index=variant_db.index)

but, because of that lambda, it is quite slow for big data (hundreds of thousands of rows). 但是,由于存在lambda,因此对于大数据(数十万行)而言,速度相当慢。 I tried with the much faster: 我尝试了更快的速度:

df['NEW'] = s.reindex(df.index, method='ffill')

but in this way there are many NaNs in df['NEW'], which should not happen. 但是以这种方式,df ['NEW']中有许多NaN,这是不应该发生的。 Using method='bfill' I get NaNs in different positions, but some rows get NaNs in both cases, so even using both does not work. 使用method ='bfill'可以使NaN处于不同的位置,但是在两种情况下某些行都可以得到NaN,因此即使同时使用这两种方法也不起作用。

I would like a way to do this using library function only, for efficiency. 我想要一种仅使用库函数来实现此目的的方法,以提高效率。 Can anyone help? 有人可以帮忙吗?

You can try this very simple solution with your big data for performance: 您可以对大数据尝试使用这种非常简单的解决方案来提高性能:

df1=pandas.DataFrame([
{'CHROM':'chr1','POS':43,'LABEL':'strA'},
{'CHROM':'chr1','POS':43,'LABEL':'strB'},
{'CHROM':'chr1','POS':66,'LABEL':'strC'},
{'CHROM':'chr1','POS':66,'LABEL':'strB'},
{'CHROM':'chr2','POS':29,'LABEL':'strD'}])

df2=pandas.DataFrame([
{'CHROM':'chr1','POS':43,'VAL':'v1'},
{'CHROM':'chr1','POS':66,'VAL':'v2'},
{'CHROM':'chr2','POS':29,'VAL':'v3'}])

for i,r in df2.iterrows():
    df1.ix[(df1['CHROM']==r['CHROM']) & (df1['POS']==r['POS']),'NEW']=r['VAL']

Or using indexes: 或使用索引:

df1=pandas.DataFrame([
{'CHROM':'chr1','POS':43,'LABEL':'strA','COL':''},
{'CHROM':'chr1','POS':43,'LABEL':'strB','COL':''},
{'CHROM':'chr1','POS':66,'LABEL':'strC','COL':''},
{'CHROM':'chr1','POS':66,'LABEL':'strB','COL':''},
{'CHROM':'chr2','POS':29,'LABEL':'strD','COL':''}]).set_index(['CHROM','POS','LABEL'])

df2=pandas.DataFrame([
{'CHROM':'chr1','POS':43,'VAL':'v1'},
{'CHROM':'chr1','POS':66,'VAL':'v2'},
{'CHROM':'chr2','POS':29,'VAL':'v3'}]).set_index(['CHROM','POS'])

for i,r in df2.iterrows():
    df1.ix[(i[0],i[1]),'NEW']=r['VAL']

this is what pandas is all about. 这就是大熊猫的全部意义。 use the indices to your advantage. 利用索引来发挥您的优势。

df1 = df1.reset_index().set_index(['CHROM', 'POS'])
df1['NEW'] = df2.VAL

Elaborating on the answer provided by @acushner, something like this should work 详细说明@acushner提供的答案,这样的东西应该可以工作

midx = pd.MultiIndex.from_product(
    [["chr1","chr2"],[43,66,29],["strA","strB","strC"]],
    names=["CHROM", "POS", "LABEL"]
    )

df = pd.DataFrame(random.random([18,2]), index=midx)

midx2 = pd.MultiIndex.from_product([["chr1","chr2"],[43,66,29]],
                                   names=["CHROM", "POS"])
ser = pd.Series(random.random(6), index=midx2)

df = df.reset_index().set_index(['CHROM', 'POS'])
df[2] = ser
df.set_index("LABEL", append=True, inplace=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM