[英]More efficient way of appending dataframe
I was running some tests and found out that this piece of code here is inefficient.我正在运行一些测试,发现这里的这段代码效率很低。 Looping in range of dates, if self.query
is in df
then appends the line, pretty straight forward.在日期范围内循环,如果self.query
在df
中,则附加该行,非常简单。 But I heard many opinions, that appending like this isn't efficient and even resource hungry.但我听到很多意见,认为像这样追加效率不高,甚至很耗资源。
My parquets have 4 columns with milions of lines - query
phone_count
desktop_count
total
, dropping 2 cols which means I have index
, query
and total
and then the magic happens.我的镶木地板有 4 列,有数百万行 - query
phone_count
desktop_count
total
,删除 2 cols 这意味着我有index
, query
和total
然后魔术发生了。
This code is working "fine", but now I'm looking for opinions from experienced users and possibly getting some hints.这段代码工作“很好”,但现在我正在寻找有经验的用户的意见,并可能得到一些提示。
Is there a way of doing the same in more efficient way?有没有办法以更有效的方式做同样的事情? Tuples maybe?也许是元组?
Thank you, guys!谢谢你们!
for filename in os.listdir(directory):
if filename.endswith(".parquet"):
df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
df.set_index("query", inplace=True)
if self.lowercase == "on":
df.index = df.index.str.lower()
else:
pass
if self.sensitive == "on":
self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query), axis=0))
else:
self.datafr = self.datafr.append(df.filter(regex=re.compile(self.query, re.IGNORECASE), axis=0))
self.datafr = self.datafr.groupby(['query']).sum().sort_values(by='total', ascending=False)
You are repeating a few things with each loop:您在每个循环中重复一些事情:
DataFrame.append
is slower than pd.concat([frame1, frame2, ...])
重复DataFrame.append
比pd.concat([frame1, frame2, ...])
慢list.append
is a lot faster than DataFrame.append
list.append
比DataFrame.append
快很多Try this:试试这个:
option = re.IGNORECASE if self.lowercase == "on" else 0
pattern = re.compile(self.query, option)
subframes = []
for filename in os.listdir(directory):
if filename.endswith(".parquet"):
df = pd.read_parquet(directory).drop(["phone_count","desktop_count"], axis=1)
df.set_index("query", inplace=True)
# Not sure if this statement is necessary. The regex
# is already IGNORECASE when lowercase == "on"
if self.lowercase == "on":
df.index = df.index.str.lower()
# Multiple list.append
subframes.append(df.filter(pattern, axis=0))
# But a single pd.concat
self.datafr = pd.concat([self.datafr] + subframes)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.