简体   繁体   English

根据前缀将列加载到多个DataFrame中

[英]Loading columns into multiple DataFrames based on prefix

I want to load columns with specific prefixes into separate DataFrames. 我想将具有特定前缀的列加载到单独的DataFrames中。

The columns I want have specific prefixes ie 我想要的列具有特定的前缀,即

   A_1 A_2 B_1 B_2 C_1 C_2
   1   0   0   0   0   0
   1   0   0   1   1   1
   0   1   1   1   1   0

I have a list of all the prefixes: 我有所有前缀的列表:

prefixes = ["A", "B", "C"]

I want to do something like this: 我想做这样的事情:

for prefix in prefixes:
    f"df_{prefix}" = pd.read_csv("my_file.csv",
                                 usecols=[f"{prefix}_1,
                                          f"{prefix}_2,
                                          f"{prefix}_3,])

So each DataFrame has the prefix in the name, but I'm not quite sure of the best way to do this or the syntax required. 因此,每个DataFrame的名称中都有前缀,但是我不确定这样做的最佳方法或所需的语法。

You could try it with a different approach. 您可以尝试使用其他方法。 Load the full csv once. 一次加载完整的csv。 Create three dfs out of it by dropping the columns don't mach your prefix. 通过删除不匹配前缀的列来创建三个df。

x = pd.read_csv("my_file.csv")
notA = [c for c in x.columns if 'A' not in c]
notB = [c for c in x.columns if 'B' not in c]
notC = [c for c in x.columns if 'C' not in c]
a = x.drop(notA,1)
b = x.drop(notB,1)
c = x.drop(notC,1)

Considering you have a big dataframe like this: 考虑到您有一个像这样的大数据框:

In [1341]: df
Out[1341]: 
   A_1  A_2  B_1  B_2  C_1  C_2
0    1    0    0    0    0    0
1    1    0    0    1    1    1
2    0    1    1    1    1    0

Have a master list of prefixes: 有一个前缀的主列表:

In [1374]: master_list = ['A','B','C']

Create an empty dictionary to hold multiple subsets of dataframe: 创建一个空字典以容纳数据帧的多个子集:

In [1377]: dct = {}

Loop through the master list and store the column names in the above dict: 遍历主列表并将列名称存储在上述字典中:

In [1378]: for i in master_list:
      ...:     dct['{}_list'.format(i)] = [e for e in df.columns if e.startswith('{}'.format(i))]

Now, the dct has below keys with values: 现在,该dct具有以下具有值的键:

A_list : ['A_1', 'A_2']
B_list : ['B_1', 'B_2']
C_list : ['C_1', 'C_2']

Then, subset your dataframes like below: 然后,如下子集您的数据框:

In [1381]: for k in dct:
      ...:     dct[k] = df[dct[k]]

Now, the dictionary has actual rows of dataframe against every key: 现在,字典针对每个键都有数据帧的实际行:

In [1384]: for k in dct:
      ...:     print dct[k]

In [1347]: df_A
Out[1347]: 
   A_1  A_2
0    1    0
1    1    0
2    0    1

In [1350]: df_B
Out[1350]: 
   B_1  B_2
0    0    0
1    0    1
2    1    1

In [1355]: df_C
Out[1355]: 
   C_1  C_2
0    0    0
1    1    1
2    1    0

First filter out not matched columns with startswith with boolean indexing and loc , because filter columns: 首先使用boolean indexingloc过滤出startswith不匹配的列,因为过滤器列:

print (df)
   A_1  A_2  B_1  B_2  C_1  D_2
0    1    0    0    0    0    0
1    1    0    0    1    1    1
2    0    1    1    1    1    0

prefixes = ["A", "B", "C"]
df = df.loc[:, df.columns.str.startswith(tuple(prefixes))]
print (df)
   A_1  A_2  B_1  B_2  C_1
0    1    0    0    0    0
1    1    0    0    1    1
2    0    1    1    1    1

Then create Multiindex by split and then dictionary with groupby for dictioanry of DataFrames: 然后通过split创建Multiindex ,然后使用groupby字典以表示DataFrames的特殊性:

df.columns = df.columns.str.split('_', expand=True)
print (df)

   A     B     C
   1  2  1  2  1
0  1  0  0  0  0
1  1  0  0  1  1
2  0  1  1  1  1

d = {k: v[k] for k, v in df.groupby(level=0, axis=1)}
print (d['A'])
   1  2
0  1  0
1  1  0
2  0  1

Or use lambda function with split : 或者将lambda函数与split

d = {k: v for k, v in df.groupby(lambda x: x.split('_')[0], axis=1)}
print (d['A'])
   A_1  A_2
0    1    0
1    1    0
2    0    1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM