简体   繁体   English

在 Pandas 中创建新数据框的正确索引

[英]Proper indexing to make new dataframes in Pandas

Basically I'm trying to rearrange a horrible csv file into usable information and I think I'm trying to cheat the slicing process which is resulting in a lot of indexing vs copy warnings and eventually the wrong result.基本上,我正在尝试将一个可怕的 csv 文件重新排列为可用信息,并且我想我正在尝试欺骗切片过程,这导致了大量索引与复制警告,并最终导致错误的结果。

I have data that looks like this:我有看起来像这样的数据:

lipid1 #some of the names of lipids have commas in them which is an added challenge
tissue1,1
tissue2,6
tissue3,3
tissue4,2
tissue5,5


lipid2
tissue1,24
tissue2,15
tissue3,12
tissue4,14
tissue5,10

and I want to get it to be something like我想让它像

        tissue1  tissue2  tissue3  tissue4  tissue5
lipid1  1        6        3        2        5
lipid2  24       15       12       14       10

Pretty sure this has an easy solution that I am overlooking because so far I've been using something like:很确定这有一个我忽略的简单解决方案,因为到目前为止我一直在使用类似的东西:

alldata = pd.DataFrame()
for file in glob.glob("All5tissuesPos.csv"):
    filename = file[:-4]
    tissue = file[:-7]

    dirty = pd.read_csv(filename+'.csv', sep='\n', header=None, names=['Arb'])
    #data = dirty['Arb'].str.split(',',expand=True)

    lipid = dirty.iloc[::6]['Arb'].copy()
    #lipid = dirty.iloc[lambda x:x.index%6 == 0]['Arb'].copy()

    data = dirty['Arb'].str.split(',',expand=True)

    t=data[data.index %6 != 0]

    tissue1 = t[t[0]== 'Tissue 1']
    tissue1 ['lipid'] = lipid
    alldata.append(tissue1)
    tissue1.to_csv('test.csv')

tissue1 at the last step does look like what I want, but since it's really just parts of another dataframe instead of a separate one (I think anyway) I get the warnings and when I go to append it nothing happens.最后一步的组织 1 确实看起来像我想要的,但由于它实际上只是另一个数据帧的一部分而不是一个单独的数据帧(我认为无论如何)我收到警告,当我去附加它时什么也没有发生。 What is this kind of code supposed to look like?这种代码应该是什么样的? Is there a faster way to do this for all 5 tissues at once?有没有更快的方法可以同时对所有 5 个组织执行此操作?

You can simplify this a bunch.您可以将其简化很多。 We'll use a trick of creating another column of the lipid, forward filling the value and then dropping the original row, which is no longer necessary.我们将使用创建另一列脂质的技巧,向前填充值,然后删除不再需要的原始行。 We then get to your dataset with a simple pivot.然后我们通过一个简单的数据透视表获取您的数据集。 In my sample data I have a lipid with a messy name, including commas.在我的示例数据中,我有一个名称凌乱的脂质,包括逗号。

Here I use every 6 rows like your condition, but if the data are messier and some rows are missing you could just as easily use a condition with something like .str.contains('lipid') .在这里,我像您的条件一样使用每 6 行,但是如果数据更混乱并且某些行丢失,您可以轻松地使用类似.str.contains('lipid')

dirty = pd.read_csv('test.csv', sep='\n', header=None, names=['Arb'])

# Broadcast lipid name, drop that "header" row
dirty['lipid_name'] = dirty['Arb'].where(dirty.index%6 == 0).ffill()
dirty = dirty[dirty.index%6 != 0]

# Now we can split data properly
dirty = dirty.set_index('lipid_name')['Arb'].str.split(',', expand=True)

dirty.pivot(columns=0, values=1).rename_axis(None, axis=1)

                           tissue1 tissue2 tissue3 tissue4 tissue5
lipid_name                                                             
lipid11231,12312313,123123       1       6       3       2       5
lipid2                          24      15      12      14      10

Sample Data: test.csv示例数据: test.csv

lipid11231,12312313,123123
tissue1,1
tissue2,6
tissue3,3
tissue4,2
tissue5,5
lipid2
tissue1,24
tissue2,15
tissue3,12
tissue4,14
tissue5,10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM