简体   繁体   English

更有效的附加方式 dataframe

[英]More efficient way of appending dataframe

I was running some tests and found out that this piece of code here is inefficient.我正在运行一些测试,发现这里的这段代码效率很低。 Looping in range of dates, if self.query is in df then appends the line, pretty straight forward.在日期范围内循环,如果self.querydf中,则附加该行,非常简单。 But I heard many opinions, that appending like this isn't efficient and even resource hungry.但我听到很多意见,认为像这样追加效率不高,甚至很耗资源。
My parquets have 4 columns with milions of lines - query phone_count desktop_count total , dropping 2 cols which means I have index , query and total and then the magic happens.我的镶木地板有 4 列,有数百万行 - query phone_count desktop_count total ,删除 2 cols 这意味着我有indexquerytotal然后魔术发生了。

This code is working "fine", but now I'm looking for opinions from experienced users and possibly getting some hints.这段代码工作“很好”,但现在我正在寻找有经验的用户的意见,并可能得到一些提示。

Is there a way of doing the same in more efficient way?有没有办法以更有效的方式做同样的事情? Tuples maybe?也许是元组?

Thank you, guys!谢谢你们!

    for filename in os.listdir(directory):
        if filename.endswith(".parquet"):
            df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
            df.set_index("query", inplace=True)

            if self.lowercase == "on":
                df.index = df.index.str.lower()
            else:
                pass
            if self.sensitive == "on":                            
                self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query), axis=0))
            else:            
                self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query, re.IGNORECASE), axis=0))            


self.datafr = self.datafr.groupby(['query']).sum().sort_values(by='total', ascending=False)

You are repeating a few things with each loop:您在每个循环中重复一些事情:

  • The regex pattern does not need recompiling every time正则表达式模式不需要每次都重新编译
  • Repeated DataFrame.append is slower than pd.concat([frame1, frame2, ...])重复DataFrame.appendpd.concat([frame1, frame2, ...])
  • list.append is a lot faster than DataFrame.append list.appendDataFrame.append快很多

Try this:试试这个:

option = re.IGNORECASE if self.lowercase == "on" else 0
pattern = re.compile(self.query, option)
subframes = []

for filename in os.listdir(directory):
    if filename.endswith(".parquet"):
        df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
        df.set_index("query", inplace=True)

        # Not sure if this statement is necessary. The regex
        # is already IGNORECASE when lowercase == "on"
        if self.lowercase == "on":
            df.index = df.index.str.lower()

        # Multiple list.append
        subframes.append(df.filter(pattern, axis=0))

# But a single pd.concat
self.datafr = pd.concat([self.datafr] + subframes)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM