[英]Import multiple excel files into python pandas and concatenate them into one dataframe
I would like to read several excel files from a directory into pandas and concatenate them into one big dataframe.我想从一个目录中读取几个 excel 文件到 pandas 并将它们连接成一个大 dataframe。 I have not been able to figure it out though.我一直无法弄清楚。 I need some help with the for loop and building a concatenated dataframe: Here is what I have so far:我需要一些有关 for 循环和构建串联 dataframe 的帮助:这是我目前所拥有的:
import sys
import csv
import glob
import pandas as pd
# get data file names
path =r'C:\DRO\DCL_rawdata_files\excelfiles'
filenames = glob.glob(path + "/*.xlsx")
dfs = []
for df in dfs:
xl_file = pd.ExcelFile(filenames)
df=xl_file.parse('Sheet1')
dfs.concat(df, ignore_index=True)
As mentioned in the comments, one error you are making is that you are looping over an empty list.正如评论中提到的,您犯的一个错误是您正在循环一个空列表。
Here is how I would do it, using an example of having 5 identical Excel files that are appended one after another.下面是我将如何做到这一点,使用一个有 5 个相同的 Excel 文件的示例,这些文件一个接一个地附加。
(1) Imports: (1) 进口:
import os
import pandas as pd
(2) List files: (2) 列表文件:
path = os.getcwd()
files = os.listdir(path)
files
Output: Output:
['.DS_Store',
'.ipynb_checkpoints',
'.localized',
'Screen Shot 2013-12-28 at 7.15.45 PM.png',
'test1 2.xls',
'test1 3.xls',
'test1 4.xls',
'test1 5.xls',
'test1.xls',
'Untitled0.ipynb',
'Werewolf Modelling',
'~$Random Numbers.xlsx']
(3) Pick out 'xls' files: (3) 挑选出'xls'文件:
files_xls = [f for f in files if f[-3:] == 'xls']
files_xls
Output: Output:
['test1 2.xls', 'test1 3.xls', 'test1 4.xls', 'test1 5.xls', 'test1.xls']
(4) Initialize empty dataframe: (4)初始化空dataframe:
df = pd.DataFrame()
(5) Loop over list of files to append to empty dataframe: (5) 将文件列表循环到 append 到清空 dataframe:
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
(6) Enjoy your new dataframe.:-) (6) 享受您的新 dataframe.:-)
df
Output: Output:
Result Sample
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
9 j 10
this works with python 2.x这适用于 python 2.x
be in the directory where the Excel files are在 Excel 文件所在的目录中
see http://pbpython.com/excel-file-combine.html见http://pbpython.com/excel-file-combine.html
import numpy as np
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
# now save the data frame
writer = pd.ExcelWriter('output.xlsx')
all_data.to_excel(writer,'sheet1')
writer.save()
This can be done in this way:这可以通过以下方式完成:
import pandas as pd
import glob
all_data = pd.DataFrame()
for f in glob.glob("/path/to/directory/*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
all_data.to_csv("new_combined_file.csv")
#shortcut #捷径
import pandas as pd
from glob import glob
dfs=[]
for f in glob("data/*.xlsx"):
dfs.append(pd.read_excel(f))
df=pd.concat(dfs, ignore_index=True)
You can use list comprehension inside concat
:您可以在concat
中使用列表推导:
import os
import pandas
path = '/path/to/directory/'
filenames = [file for file in os.listdir(path) if file.endswith('.xlsx')]
df = pd.concat([pd.read_excel(path + file) for file in filenames], ignore_index=True)
With ignore_index = True
the index of df
will be labeled 0, …, n - 1 .使用ignore_index = True
, df
的索引将被标记为0, ..., n - 1 。
I have multiple excel files and every file has a common id [every excel sheet has id column].我有多个 excel 文件,每个文件都有一个共同的 id [每个 excel 表都有 id 列]。 I tried in the following ways.我尝试了以下方法。 I am not getting the correct data frame based on the id.我没有根据 id 获得正确的数据框。 import pandas as pd import os导入熊猫作为 pd 导入 os
path=os.getcwd()
path
files=os.listdir(path)
fil_xlsx=[f for f in files if f[-4:]=='xlsx']
df=pd.DataFrame()
for f in fil_xlsx:
data=pd.read_excel(f,'Sheet1')
df=df.append(data)
I am getting an empty data frame this way.我通过这种方式得到一个空的数据框。
df=pd.DataFrame()
for f in fil_xlsx:
data=pd.read_excel(f,'Sheet1')
all1=pd.concat([data,df],ignore_index=True,join="inner")
There is an even neater way to do that.有一种更简洁的方法可以做到这一点。
# import libraries
import glob
import pandas as pd
# get the absolute path of all Excel files
allExcelFiles = glob.glob("/path/to/Excel/files/*.xlsx")
# read all Excel files at once
df = pd.concat(pd.read_excel(excelFile) for excelFile in allExcelFiles)
import pandas as pd
import os
os.chdir('...')
#read first file for column names
fdf= pd.read_excel("first_file.xlsx", sheet_name="sheet_name")
#create counter to segregate the different file's data
fdf["counter"]=1
nm= list(fdf)
c=2
#read first 1000 files
for i in os.listdir():
print(c)
if c<1001:
if "xlsx" in i:
df= pd.read_excel(i, sheet_name="sheet_name")
df["counter"]=c
if list(df)==nm:
fdf=fdf.append(df)
c+=1
else:
print("headers name not match")
else:
print("not xlsx")
fdf=fdf.reset_index(drop=True)
#relax
import pandas as pd
import os
files = [file for file in os.listdir('./Salesfolder')]
all_month_sales= pd.DataFrame()
for file in files
df= pd.read_csv("./Salesfolder/"+file)
all_months_data=pd.concat([all_months_sales,df])
all_months_data.to_csv("all_data.csv",index=False)
You can go and read all your.xls files from folder (Salesfolder in my case) and same for your local path.您可以 go 并从文件夹(在我的情况下为 Salesfolder)中读取所有 your.xls 文件,对于您的本地路径也是如此。 Using iteration through whcih you can put them into empty data frame and you can concatnate your data frame to this.通过 whcih 使用迭代,您可以将它们放入空数据框中,您可以将您的数据框连接到此。 I have also exported to another csv for all months data into one csv file我还将所有月份的数据导出到另一个 csv 到一个 csv 文件中
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.