更有效的附加方式 dataframe

Question

I was running some tests and found out that this piece of code here is inefficient.我正在运行一些测试，发现这里的这段代码效率很低。 Looping in range of dates, if self.query is in df then appends the line, pretty straight forward.在日期范围内循环，如果self.query在df中，则附加该行，非常简单。 But I heard many opinions, that appending like this isn't efficient and even resource hungry.但我听到很多意见，认为像这样追加效率不高，甚至很耗资源。
My parquets have 4 columns with milions of lines - query phone_count desktop_count total , dropping 2 cols which means I have index , query and total and then the magic happens.我的镶木地板有 4 列，有数百万行 - query phone_count desktop_count total ，删除 2 cols 这意味着我有index ， query和total然后魔术发生了。

This code is working "fine", but now I'm looking for opinions from experienced users and possibly getting some hints.这段代码工作“很好”，但现在我正在寻找有经验的用户的意见，并可能得到一些提示。

Is there a way of doing the same in more efficient way?有没有办法以更有效的方式做同样的事情？ Tuples maybe?也许是元组？

Thank you, guys!谢谢你们！

    for filename in os.listdir(directory):
        if filename.endswith(".parquet"):
            df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
            df.set_index("query", inplace=True)

            if self.lowercase == "on":
                df.index = df.index.str.lower()
            else:
                pass
            if self.sensitive == "on":                            
                self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query), axis=0))
            else:            
                self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query, re.IGNORECASE), axis=0))            


self.datafr = self.datafr.groupby(['query']).sum().sort_values(by='total', ascending=False)

Answer 1

You are repeating a few things with each loop:您在每个循环中重复一些事情：

The regex pattern does not need recompiling every time正则表达式模式不需要每次都重新编译
Repeated DataFrame.append is slower than pd.concat([frame1, frame2, ...])重复DataFrame.append比pd.concat([frame1, frame2, ...])慢
list.append is a lot faster than DataFrame.append list.append比DataFrame.append快很多

Try this:试试这个：

option = re.IGNORECASE if self.lowercase == "on" else 0
pattern = re.compile(self.query, option)
subframes = []

for filename in os.listdir(directory):
    if filename.endswith(".parquet"):
        df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
        df.set_index("query", inplace=True)

        # Not sure if this statement is necessary. The regex
        # is already IGNORECASE when lowercase == "on"
        if self.lowercase == "on":
            df.index = df.index.str.lower()

        # Multiple list.append
        subframes.append(df.filter(pattern, axis=0))

# But a single pd.concat
self.datafr = pd.concat([self.datafr] + subframes)

更有效的附加方式 dataframe

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-02-14 02:09:40

更有效的附加方式 dataframe

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-02-14 02:09:40

解决方案1
2 已采纳 2020-02-14 02:09:40