[英]Slow Data analysis using pandas
I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. 我正在使用列表和熊猫数据框的混合来完成csv数据的清理和合并。 The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data. 以下是我的代码运行缓慢的摘要...生成包含约3MM行数据的csv。
UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
DateList = []
DaysList = []
PDaysList = []
OperatorList = []
OGOnumList = []
CountyList = []
MunicipalityList = []
LatitudeList = []
LongitudeList = []
UnconventionalList = []
ConfigurationList = []
HomeUseList = []
ReportingPeriodList = []
RecordSourceList = []
for j in range(0,len(API)):
if UniqueAPI[i] == API[j]:
#print(str(ProdDate[j]))
DateList.append(ProdDate[j])
DaysList = Days[j]
OperatorList = Operator[j]
OGOnumList = OGOnum[j]
CountyList = County[j]
MunicipalityList = Municipality[j]
LatitudeList = Latitude[j]
LongitudeList = Longitude[j]
UnconventionalList = Unconventional[j]
ConfigurationList = Configuration[j]
HomeUseList = HomeUse[j]
ReportingPeriodList = ReportingPeriod[j]
RecordSourceList = RecordSource[j]
df = pd.DataFrame(DateList, columns = ['Date'])
df['Date'] = pd.to_datetime(df['Date'])
minDate = df.min()
maxDate = df.max()
Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
finalMonths = Months - Years*12 + 1
Y,x = str(minDate).split("-",1)
x,Y = str(Y).split(" ",1)
for k in range(0,Years + 1):
if k == Years:
ender = int(finalMonths + 1)
else:
ender = int(13)
full_df = pd.DataFrame()
if k > 0:
del full_df
full_df = pd.DataFrame()
full_df['API'] = UniqueAPI[i]
full_df['Production Month'] = [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
full_df['Days'] = DaysList
full_df['Operator'] = OperatorList
full_df['OGO_NUM'] = OGOnumList
full_df['County'] = CountyList
full_df['Municipality'] = MunicipalityList
full_df['Latitude'] = LatitudeList
full_df['Longitude'] = LongitudeList
full_df['Unconventional'] = UnconventionalList
full_df['Well_Configuration'] = ConfigurationList
full_df['Home_Use'] = HomeUseList
full_df['Reporting_Period'] = ReportingPeriodList
full_df['Record_Source'] = RecordSourceList
dummydata.append(full_df)
full_df = pd.concat(dummydata)
result = full_df.merge(dataClean,how='left').fillna(0)
print(result[:100])
result.to_csv(ResultPath, index_label=False, index=False)
This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe: 这个代码段已经运行了几个小时,输出应该有〜3MM行,必须有一种更快的方法来使用熊猫来实现我将描述的目标:
This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. 这一切花费的时间比我预期的要长,我想找到每个唯一项目的最短日期并在它们之间每月进行插值,以填充没有数据为0的月份,就像在Pandas中使用三行代码一样。 Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated! 你们认为我应该探索的任何选项或任何可以帮助我的代码段都将受到赞赏!
You could start by cleaning up the code a bit. 您可以先清理一下代码。 These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe: 这些行似乎没有任何作用或功能目的,因为full_df刚刚创建,并且已经是一个空的数据框:
if k > 0:
del full_df
full_df = pd.DataFrame()
Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. 然后,当您实际构建full_df时,最好一次完成所有操作,而不是一次完成一列。 So try something like this: 所以尝试这样的事情:
full_df = pd.concat([UniqueAPI[i],
[pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
DaysList,
etc...
],
axis=1)
Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call). 然后,您需要添加列标签,您也可以一次完成所有操作(以与concat()调用中的列表相同的顺序)。
full_df.columns = ['API', 'Production Month', 'Days', etc.] full_df.columns = ['API','生产月','天'等。]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.