[英]Loading columns into multiple DataFrames based on prefix
I want to load columns with specific prefixes into separate DataFrames. 我想将具有特定前缀的列加载到单独的DataFrames中。
The columns I want have specific prefixes ie 我想要的列具有特定的前缀,即
A_1 A_2 B_1 B_2 C_1 C_2
1 0 0 0 0 0
1 0 0 1 1 1
0 1 1 1 1 0
I have a list of all the prefixes: 我有所有前缀的列表:
prefixes = ["A", "B", "C"]
I want to do something like this: 我想做这样的事情:
for prefix in prefixes:
f"df_{prefix}" = pd.read_csv("my_file.csv",
usecols=[f"{prefix}_1,
f"{prefix}_2,
f"{prefix}_3,])
So each DataFrame has the prefix in the name, but I'm not quite sure of the best way to do this or the syntax required. 因此,每个DataFrame的名称中都有前缀,但是我不确定这样做的最佳方法或所需的语法。
You could try it with a different approach. 您可以尝试使用其他方法。 Load the full csv once. 一次加载完整的csv。 Create three dfs out of it by dropping the columns don't mach your prefix. 通过删除不匹配前缀的列来创建三个df。
x = pd.read_csv("my_file.csv")
notA = [c for c in x.columns if 'A' not in c]
notB = [c for c in x.columns if 'B' not in c]
notC = [c for c in x.columns if 'C' not in c]
a = x.drop(notA,1)
b = x.drop(notB,1)
c = x.drop(notC,1)
Considering you have a big dataframe like this: 考虑到您有一个像这样的大数据框:
In [1341]: df
Out[1341]:
A_1 A_2 B_1 B_2 C_1 C_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
In [1374]: master_list = ['A','B','C']
Create an empty dictionary to hold multiple subsets of dataframe: 创建一个空字典以容纳数据帧的多个子集:
In [1377]: dct = {}
Loop through the master list and store the column names in the above dict: 遍历主列表并将列名称存储在上述字典中:
In [1378]: for i in master_list:
...: dct['{}_list'.format(i)] = [e for e in df.columns if e.startswith('{}'.format(i))]
Now, the dct
has below keys with values: 现在,该dct
具有以下具有值的键:
A_list : ['A_1', 'A_2']
B_list : ['B_1', 'B_2']
C_list : ['C_1', 'C_2']
Then, subset your dataframes like below: 然后,如下子集您的数据框:
In [1381]: for k in dct:
...: dct[k] = df[dct[k]]
Now, the dictionary has actual rows of dataframe against every key: 现在,字典针对每个键都有数据帧的实际行:
In [1384]: for k in dct:
...: print dct[k]
In [1347]: df_A
Out[1347]:
A_1 A_2
0 1 0
1 1 0
2 0 1
In [1350]: df_B
Out[1350]:
B_1 B_2
0 0 0
1 0 1
2 1 1
In [1355]: df_C
Out[1355]:
C_1 C_2
0 0 0
1 1 1
2 1 0
First filter out not matched columns with startswith
with boolean indexing
and loc
, because filter columns: 首先使用boolean indexing
和loc
过滤出startswith
不匹配的列,因为过滤器列:
print (df)
A_1 A_2 B_1 B_2 C_1 D_2
0 1 0 0 0 0 0
1 1 0 0 1 1 1
2 0 1 1 1 1 0
prefixes = ["A", "B", "C"]
df = df.loc[:, df.columns.str.startswith(tuple(prefixes))]
print (df)
A_1 A_2 B_1 B_2 C_1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
Then create Multiindex
by split
and then dictionary with groupby
for dictioanry of DataFrames: 然后通过split
创建Multiindex
,然后使用groupby
字典以表示DataFrames的特殊性:
df.columns = df.columns.str.split('_', expand=True)
print (df)
A B C
1 2 1 2 1
0 1 0 0 0 0
1 1 0 0 1 1
2 0 1 1 1 1
d = {k: v[k] for k, v in df.groupby(level=0, axis=1)}
print (d['A'])
1 2
0 1 0
1 1 0
2 0 1
Or use lambda function with split
: 或者将lambda函数与split
:
d = {k: v for k, v in df.groupby(lambda x: x.split('_')[0], axis=1)}
print (d['A'])
A_1 A_2
0 1 0
1 1 0
2 0 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.