简体   繁体   English

使用熊猫进行缓慢的数据分析

[英]Slow Data analysis using pandas

I am using a mixture of both lists and pandas dataframes to accomplish a clean and merge of csv data. 我正在使用列表和熊猫数据框的混合来完成csv数据的清理和合并。 The following is a snippet from my code that runs disgustingly slow... Generates a csv with about 3MM lines of data. 以下是我的代码运行缓慢的摘要...生成包含约3MM行数据的csv。

UniqueAPI = Uniquify(API)
dummydata = []
#bridge the gaps in the data with zeros
for i in range(0,len(UniqueAPI)):
    DateList = []
    DaysList = []
    PDaysList = []
    OperatorList = []
    OGOnumList = []
    CountyList = []
    MunicipalityList = []
    LatitudeList = []
    LongitudeList = []
    UnconventionalList = []
    ConfigurationList = []
    HomeUseList = []
    ReportingPeriodList = []
    RecordSourceList = []

    for j in range(0,len(API)):
        if UniqueAPI[i] == API[j]:
            #print(str(ProdDate[j]))
            DateList.append(ProdDate[j])
            DaysList = Days[j]
            OperatorList = Operator[j]
            OGOnumList = OGOnum[j]
            CountyList = County[j]
            MunicipalityList = Municipality[j]
            LatitudeList = Latitude[j]
            LongitudeList = Longitude[j]
            UnconventionalList = Unconventional[j]
            ConfigurationList = Configuration[j]
            HomeUseList = HomeUse[j]
            ReportingPeriodList = ReportingPeriod[j]
            RecordSourceList = RecordSource[j]

    df = pd.DataFrame(DateList, columns = ['Date'])
    df['Date'] = pd.to_datetime(df['Date'])
    minDate = df.min()
    maxDate = df.max()

    Years = int((maxDate - minDate)/np.timedelta64(1,'Y'))
    Months = int(round((maxDate - minDate)/np.timedelta64(1,'M')))
    finalMonths = Months - Years*12 + 1
    Y,x = str(minDate).split("-",1)
    x,Y = str(Y).split("   ",1)
    for k in range(0,Years + 1):

        if k == Years:
            ender = int(finalMonths + 1)
        else:
            ender = int(13)

        full_df = pd.DataFrame()
        if k > 0:
            del full_df
            full_df = pd.DataFrame()

        full_df['API'] = UniqueAPI[i]
        full_df['Production Month'] =     [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)]
        full_df['Days'] = DaysList
        full_df['Operator'] = OperatorList
        full_df['OGO_NUM'] = OGOnumList
        full_df['County'] = CountyList
        full_df['Municipality'] = MunicipalityList
        full_df['Latitude'] = LatitudeList
        full_df['Longitude'] = LongitudeList
        full_df['Unconventional'] = UnconventionalList
        full_df['Well_Configuration'] = ConfigurationList
        full_df['Home_Use'] = HomeUseList
        full_df['Reporting_Period'] = ReportingPeriodList
        full_df['Record_Source'] = RecordSourceList
        dummydata.append(full_df)

full_df = pd.concat(dummydata)                                   
result =  full_df.merge(dataClean,how='left').fillna(0)

print(result[:100])

result.to_csv(ResultPath, index_label=False, index=False)

This snippet of code has been running for hours the output should have ~3MM lines there has to be a faster way using pandas to accomplish the goal of which I will describe: 这个代码段已经运行了几个小时,输出应该有〜3MM行,必须有一种更快的方法来使用熊猫来实现我将描述的目标:

  • for each unique API i find all occurrences in the main list of apis 对于每个唯一的API,我在api主列表中找到所有出现的事件
  • using that information i build a list of dates 利用这些信息,我建立了一个日期列表
  • I find a min and max date for each list corresponding to an api 我找到对应于api的每个列表的最小和最大日期
  • I then build an empty pandas DataFrame that has every month between the two dates for the associated api 然后,我建立一个空的熊猫DataFrame,该月的两个日期之间每个月都有一个关联的api
  • I then append this data frame to a list "dummydata" and loop to the next api 然后,我将此数据框附加到列表“ dummydata”并循环到下一个api
  • taking this dummy data list I then concatenate it into a DataFrame 拿这个虚拟数据列表,然后将其串联到一个DataFrame中
  • this DataFrame is then merged with another dataframe with cleaned data 然后将此DataFrame与具有已清理数据的另一个DataFrame合并
  • end result is a csv that has 0 value for dates that did not exist but should between the max and min dates for each corresponding API in the original unclean list 最终结果是一个csv,其值为0的日期不存在,但应在原始不干净列表中每个相应API的最大和最小日期之间

This all takes way longer than I would expect I would have thought that finding the min max date for each unique item and interpolating monthly between them filling in months that dont have data with 0 would be like a three line thing in Pandas. 这一切花费的时间比我预期的要长,我想找到每个唯一项目的最短日期并在它们之间每月进行插值,以填充没有数据为0的月份,就像在Pandas中使用三行​​代码一样。 Any options that you guys think I should explore or any snippets of code that could help me out is much appreciated! 你们认为我应该探索的任何选项或任何可以帮助我的代码段都将受到赞赏!

You could start by cleaning up the code a bit. 您可以先清理一下代码。 These lines don't seem to have any effect or functional purpose since full_df was just created and is already an empty dataframe: 这些行似乎没有任何作用或功能目的,因为full_df刚刚创建,并且已经是一个空的数据框:

if k > 0:
    del full_df
    full_df = pd.DataFrame()

Then when you actually build up your full_df it's better to do it all at once rather than one column at a time. 然后,当您实际构建full_df时,最好一次完成所有操作,而不是一次完成一列。 So try something like this: 所以尝试这样的事情:

full_df = pd.concat([UniqueAPI[i],
                     [pd.to_datetime(str(x)+'/1/'+str(int(Y)+k)) for x in range(1,ender)],
                     DaysList,
                     etc...
                     ],
                    axis=1)

Then you would need to add the column labels which you could also do all at once (in the same order as your lists in the concat() call). 然后,您需要添加列标签,您也可以一次完成所有操作(以与concat()调用中的列表相同的顺序)。

full_df.columns = ['API', 'Production Month', 'Days', etc.] full_df.columns = ['API','生产月','天'等。]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM