無法重新使用熊貓生成器對象

Question

我有以下情況（運行性能基准測試）：

def read_sql_query(query, chunk_size, cnxn):
    try:
        df = pd.read_sql_query(query, cnxn, index_col=['product_key'], chunksize=100000)
        return df
    except Exception as e:
        print(e)

def return_chunks_in_df(df, start_date, end_date):
    try:

        sub_df = pd.DataFrame()
        for chunks in df:            
            sub_df = pd.concat([sub_df, chunks.loc[(chunks['trans_date'] > start_date) & (chunks['trans_date'] < end_date)]], ignore_index=True)
        print(sub_df.info())
        return sub_df    
    except Exception as e:
        print(e)

query = r"select * from  sales_rollup where  product_key in (select product_key from temp limit 10000)"

start_time = timeit.default_timer()
df = read_sql_query(query, 100000, cnxn)
print(df)
print('time to chunk:' + str(timeit.default_timer() - start_time))

#scenario 1
start_time = timeit.default_timer()
sub_df1 = return_chunks_in_df(df, '2015-01-01', '2016-01-01')
print('scenario1:' + str(timeit.default_timer() - start_time))

#scenario 2    
start_time = timeit.default_timer()
sub_df2 = return_chunks_in_df(df, '2016-01-01', '2016-12-31')
print('scenario2:' + str(timeit.default_timer() - start_time))

我遇到的問題是在方案2中，即使有過濾日期范圍的數據，數據框也總是返回0行。 我嘗試循環通過df（），但以下循環從未運行：

for chunks in df:
    print(chunks.info())

如果我在執行之前再次按如下所示重新創建df，則只能獲得方案2的結果集：

df = read_sql_query(query, 100000, cnxn)

核心問題是第一個要執行的方案總是返回第二個不返回的值。 df對象在第一次執行后會過期嗎？ 任何幫助/指針高度贊賞。

Answer 1

生成器在第一次運行后被“用完”：

def gen(n):
   for i in range(n):
       yield i

In [11]: g = gen(3)

In [12]: list(g)
Out[12]: [0, 1, 2]

In [13]: list(g)
Out[13]: []

為了重用它們，您可以重構以允許將塊傳遞給兩個：

def concat_chunk(acc, chunk, start_date, end_date):
    return pd.concat([acc, chunk.loc[(chunk['trans_date'] > start_date) & (chunk['trans_date'] < end_date)]], ignore_index=True)

sub_df1 = pd.DataFrame()
sub_df2 = pd.DataFrame()
for chunk in df:
    sub_df1 = concat_chunk(sub_df1, chunk, '2015-01-01', '2016-01-01')
    sub_df2 = concat_chunk(sub_df2, chunk, '2016-01-01', '2016-12-31')

注意：以這種方式分發它會浪費您的時間...

您可能還希望將where邏輯移到SQL中：

query = r"""select * from sales_rollup
            where product_key in (select product_key from temp limit 10000) 
            and '2015-01-01' < trans_date
            and trans_date < '2016-01-01'"""

這樣，也許您就不需要塊了！

通常，“重用生成器”的方法只是將其列為列表...但這通常無法達到目的（零散地構建它）：

chunks = list(df)  # Note chunks is probably a more descriptive name...

Answer 2

sub_df = pd.DataFrame()
for chunks in df:            
    sub_df = pd.concat([sub_df, ...
print(sub_df.info())
return sub_df

不知道為什么要兩次設置sub_df，這會使第一個設置無效。

要解決此類問題，您需要反過來思考。 首先，您應該只運行一個命令：

sub_df = pd.concat([sub_df, ...

通過靜態項而非變量輸入參數。

如果這沒問題，那么您需要找出為什么您的原始程序無法為pd.concat提供正確的參數

無法重新使用熊貓生成器對象

問題描述

2 個解決方案

解決方案1
1 已采納 2017-10-17 04:31:06

解決方案2
0 2017-10-17 03:54:56

無法重新使用熊貓生成器對象

問題描述

2 個解決方案

解決方案1 1 已采納 2017-10-17 04:31:06

解決方案2 0 2017-10-17 03:54:56

解決方案1
1 已采納 2017-10-17 04:31:06

解決方案2
0 2017-10-17 03:54:56