[英]Trying to merge into a dataframe but it keeps creating new columns
我正在尝试打开文件并从多个电子表格中派生2列(每个列1行),然后将它们合并到基本电子表格中,因此,基本数据框(来自电子表格,我只需要3列)就像这样:
Model | Roadmap | Family
a 08/12/17 ROW
b 08/14/17 MACRO
c 08/15/17 CONN
d 08/27/17 MACRO
来自多个电子表格的数据框(模型名称是电子表格名称,并且每个门都有多个日期,我在多个数据框中派生),并具有以下格式:
df1 (part1 - the dataframe derived from the spreadsheet with model a for gate 0 ):
Model | Gate 0
a 02/01/18
df1 (Dataframe derived from the spreadsheet with model a for gate1):
Model | Gate 1
a 03/01/18
df2 (part1):
Model | Gate 0
b 04/23/18
df2 (part1):
Model | Gate 1
b 05/23/18
它产生的输出是:
Model | Roadmap | Family | Gate 0_x | Gate 1_x | gate 0_y | Gate 1_y
a 08/12/17 ROW 02/01/18 03/01/18
b 08/14/17 MACRO 04/23/18 05/23/18
c 08/15/17 CONN
d 08/27/17 MACRO
我想要的输出:
Model | Roadmap | Family | Gate 0 | Gate 1
a 08/12/17 ROW 02/01/18 03/01/18
b 08/14/17 MACRO 04/23/18 05/23/18
..
以下是我正在使用的代码:
import glob
import pandas as pd
import re
import ntpath
extension = 'xlsx'
d='Final.xlsx'
c = 'Roadmap.xlsx'
dflist = []
z=[]
result = [i for i in glob.glob('*.{}'.format(extension))]
for b in result:
if b==c:
base_file = pd.read_excel(b, sheet_name='Antennas', header=7)
ind1 = base_file.set_index('Model')
ind1 = base_file[['Model', 'Roadmap', 'Family']]
#print(ind1)
ind1.to_excel('Final.xlsx')
file3 = pd.read_excel('Final.xlsx')
file3= file3.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
for a in result:
if a == c:
base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
ind1 = base_file.set_index('Model')
ind1 = base_file[['Model', 'Roadmap', 'Family']]
ind1.to_excel('Final.xlsx')
elif a != d:
gates = ['Gate 0 Complete','Gate 1 Complete']
file1 = pd.read_excel('Final.xlsx')
file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
#print(file1)
file = pd.read_excel(a, sheet_name='Timeline')
#print(file)
models = pd.DataFrame([['','']], columns=['Model', gates])
for g in gates:
z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
v=ntpath.basename(a)
v = v[5:-5]
models = pd.DataFrame([[v,z]], columns =['Model',g])
file1 = pd.merge(file1, models, how='left', on='Model')
file3 = pd.merge(file3, file1, how='left' ,['Model','Roadmap','Family'])
file3.to_excel('new.xlsx')
file3是我在for循环之前作为基本文件的数据框打开的文件。 让我知道是否有任何不清楚的地方。
当前,您要合并两次,但实际上需要将base与各个dfs合并,然后将所有内容与pd.concat
附加在一起。
下面重新创建了上面发布的示例,这些示例采用与Excel文件相同的结构并演示了合并和追加步骤。 您会注意到使用drop_duplicates
的原因是左 drop_duplicates
合并呈现了相同的行值。 在实际数据上保留或删除此方法。
数据
from io import StringIO
import pandas as pd
txt = '''
Model Roadmap Family
a some_date some
b some_date some
c some_date some
d some_date some
'''
base_df = pd.read_table(StringIO(txt), sep="\s+")
txt = '''
Model "Gate 0" "Gate 1"
a some_date some
'''
df1 = pd.read_table(StringIO(txt), sep="\s+")
txt = '''
Model "Gate 0" "Gate 1"
b some_date some
'''
df2 = pd.read_table(StringIO(txt), sep="\s+")
合并和追加 (使用列表理解)
finaldf = pd.concat([pd.merge(base_df, df, how='left', on='Model')
for df in [df1, df2]], ignore_index=True).drop_duplicates()
print(finaldf)
# Model Roadmap Family Gate 0 Gate 1
# 0 a some_date some some_date some
# 1 b some_date some NaN NaN
# 2 c some_date some NaN NaN
# 3 d some_date some NaN NaN
# 4 a some_date some NaN NaN
# 5 b some_date some some_date some
要集成到您当前的流程中,请考虑将各个模型附加到要串联并最终合并的列表中。 构建base_df作为上面发布的示例。
...
dfList = []
for g in gates:
z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
v = ntpath.basename(a)
v = v[5:-5]
mod = pd.DataFrame([[v,z]], columns =['Model',g])
models = pd.merge(models, mod, how='left', on='Model')
dfList.append(models)
finaldf = pd.merge(base_df, pd.concat(dfList), how='left', on='Model')
finaldf.to_excel('Final_Dataset.xlsx')
得到了怎么做。 如果您发现任何问题,请告诉我。
import glob
import pandas as pd
import re
import ntpath
extension = 'xlsx'
d='Final.xlsx'
c = 'Roadmap.xlsx'
dflist = []
z=[]
result = [i for i in glob.glob('*.{}'.format(extension))]
for a in result:
if a == c:
base_file = pd.read_excel(a, sheet_name='Antennas', header=7)
ind1 = base_file.set_index('Model')
ind1 = base_file[['Model', 'Roadmap', 'Family']]
#print(ind1)
ind1.to_excel('Final.xlsx')
elif a != d:
v=ntpath.basename(a)
v = v[5:-5]
gates = ['Gate 0 Complete','Gate 1 Complete', 'Gate 2 Complete']
file1 = pd.read_excel('Final.xlsx')
file1= file1.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
#print(file1)
file = pd.read_excel(a, sheet_name='Timeline')
#print(file)
models = pd.DataFrame([[v]], columns=['Model'])
#print(models)
for g in gates:
z = file.loc[file['Task'] == g, 'Complete'].iloc[0]
#print(z)
#v = re.findall(r'Scrum(\w+)', a)
#print(v)
#df1=pd.DataFrame([[v,z]], columns = ['Model',g])
mod = pd.DataFrame([[v,z]], columns =['Model',g])
models=pd.merge(models, mod, how='left', on='Model')
#print(models)
dflist.append(models)
#print(dflist)
file1 = pd.merge(file1,pd.concat(dflist), how='left',on='Model')
file1.to_excel('new.xlsx')
我假设您的原始数据如下:
df_base
df1
, df2
等-每张纸加载一个df
然后,我的方法是(按顺序)执行以下步骤:
df
垂直串联到单个名为df_sheets
df_base
与df_sheets
合并以获得所需的输出 基于此,我的方法是:
import pandas as pd
# STEP 0.
cv = ['a','b','c','d']
nr = 4
# STEP 0 - Part 1. Load Base DF
cv = cv[:nr]
df_base = pd.DataFrame(zip(*[cv,['some_date']*nr,['some']*nr]),
columns=['Model','Roadmap','Family'])
# STEP 0 - Part 2. Load Sheets DataFrames
df_sheets = []
for alph in cv:
df_sheet = pd.DataFrame(zip(*[[alph]*nr,['some_date_'+alph]*nr,['some_'+alph]*nr]),
columns=['Model','Gate0','Gate1'])
df_sheets.append(df_sheet)
print('Base DF:\n{}' .format(df_base))
# STEP 1. Vertically conctenate all sheets DataFrames together
df_sheets = pd.concat(df_sheets, axis=0).reset_index(drop=True)
print('\nDataFrames for all sheets (vertically concatenated into single DataFrame):\n{}'
.format(df_sheets))
# STEP 2. base INNER JOIN sheets USING ('Model')
ndf = df_base.merge(df_sheets, on='Model', how='inner')
print('\nOutput DataFrame:\n{}' .format(ndf))
输出为:
Base DF:
Model Roadmap Family
0 a some_date some
1 b some_date some
2 c some_date some
3 d some_date some
DataFrames for all sheets (vertically concatenated into single DataFrame):
Model Gate0 Gate1
0 a some_date_a some_a
1 a some_date_a some_a
2 a some_date_a some_a
3 a some_date_a some_a
4 b some_date_b some_b
5 b some_date_b some_b
6 b some_date_b some_b
7 b some_date_b some_b
8 c some_date_c some_c
9 c some_date_c some_c
10 c some_date_c some_c
11 c some_date_c some_c
12 d some_date_d some_d
13 d some_date_d some_d
14 d some_date_d some_d
15 d some_date_d some_d
Output DataFrame:
Model Roadmap Family Gate0 Gate1
0 a some_date some some_date_a some_a
1 a some_date some some_date_a some_a
2 a some_date some some_date_a some_a
3 a some_date some some_date_a some_a
4 b some_date some some_date_b some_b
5 b some_date some some_date_b some_b
6 b some_date some some_date_b some_b
7 b some_date some some_date_b some_b
8 c some_date some some_date_c some_c
9 c some_date some some_date_c some_c
10 c some_date some some_date_c some_c
11 c some_date some some_date_c some_c
12 d some_date some some_date_d some_d
13 d some_date some some_date_d some_d
14 d some_date some some_date_d some_d
15 d some_date some some_date_d some_d
这是你所追求的吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.