简体   繁体   中英

merging multiple tables - pd.concat: append vs yield

Assume we have quite a few of .xls or .xlsx files stored in a directory, and there are two ways of feeding them into pd.concat to get one big table: yield vs append.

def append_method(path):
    container = []
    for f in path.rglob("*.xls*"):
        df = pd.read_excel(f)
        container.append(df)
    return container


def yield_method(path):
    for f in path.rglob("*.xls*"):
        df = pd.read_excel(f)
        yield df

        
path = Path("...")
        
## append
data = append_method(path)
tables = pd.concat(data)

## yield 
data = yield_method(path)
tables = pd.concat(data)

Judging by %%timeit magic, both are pretty much the same?

tested on 100 xls/xlsx files

append: 11.8 s ± 667 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
 yield: 12.7 s ± 1.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If there's a difference between these two, which one should be used?

The pandas docs note that:

It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension. https://pandas.pydata.org/docs/user_guide/merging.html

Based on your timing results, it looks like read_excel() is the slowest part.

UPDATE: I would use the yield method.

  • the function yield_method() returns a generator, which is empty after invoking pd.concat() . It doesn't take space/resources, and communicates the intention that it has served its purpose.
  • the append_method() returns a list of data frames, which will continue to consume space from the call to pd.concat() until the garbage collector runs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM