Assume we have quite a few of .xls
or .xlsx
files stored in a directory, and there are two ways of feeding them into pd.concat
to get one big table: yield vs append.
def append_method(path):
container = []
for f in path.rglob("*.xls*"):
df = pd.read_excel(f)
container.append(df)
return container
def yield_method(path):
for f in path.rglob("*.xls*"):
df = pd.read_excel(f)
yield df
path = Path("...")
## append
data = append_method(path)
tables = pd.concat(data)
## yield
data = yield_method(path)
tables = pd.concat(data)
Judging by %%timeit
magic, both are pretty much the same?
tested on 100 xls/xlsx files
append: 11.8 s ± 667 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
yield: 12.7 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If there's a difference between these two, which one should be used?
The pandas docs note that:
It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. https://pandas.pydata.org/docs/user_guide/merging.html
Based on your timing results, it looks like read_excel()
is the slowest part.
UPDATE: I would use the yield method.
yield_method()
returns a generator, which is empty after invoking pd.concat()
. It doesn't take space/resources, and communicates the intention that it has served its purpose.append_method()
returns a list of data frames, which will continue to consume space from the call to pd.concat() until the garbage collector runs.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.