如何通過從另一個更大的數據框中選擇一些數據列表來有效地構建熊貓數據框（或字典）？

Question

我需要創建一個DataFrame或字典。 如果N = 3 （其他列表內的列表數），則預期輸出為：

d = {
    'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
    'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
    'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
    'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
    'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
    'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}

為此，我編寫了以下代碼。 但是我需要以下代碼來更快地運行：

import numpy as np
import pandas as pd

df_dict = {
    'X1': [1, 2, 3, 4, 5, 6, 7, 8, np.nan],
    'Y1': [9, 29, 39, 49, np.nan, 69, 79, 89, 99],
    'X2': [11, 12, 13, 14, 15, 16, 17, 18, np.nan],
    'Y2': [119, 129, 139, 149, np.nan, 169, 179, 189, 199],
    'X3': [21, 22, 23, 24, 25, 26, 27, 28, np.nan],
    'Y3': [219, 229, 239, 249, np.nan, 269, 279, 289, 299],
    'S': [123, 11, 123, 11, 123, 123, 123, 35, 123],
    'C': [9, 8, 7, 6, 5, 4, 3, 2, 1],
    'F': [1, 1, 1, 1, 2, 3, 3, 3, 3],
    'OTHER': [10, 20, 30, 40, 50, 60, 70, 80, 90],
}
bigger_df = pd.DataFrame(df_dict)

plots = [
    { 'x': 'X1', 'y': 'Y1', },
    { 'x': 'X2', 'y': 'Y2', },
    { 'x': 'X3', 'y': 'Y3', }
]

N = 3
d = {}
s_list = [123, 145, 35]
n = 0
for p in plots:
    # INITIALIZATES THE DICTIONARY ELEMENTS
    d['xs{}'.format(n)] = [[] for x in range(N)]
    d['ys{}'.format(n)] = [[] for x in range(N)]        

    # BUILDS THE LISTS FOR THOSE ELEMENTS
    for index in range(3):
        df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
        df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == s_list[index])]
        df.sort_values(['C'], ascending=[True], inplace=True)

        d['xs{}'.format(n)][index] = list(df[p['x']])
        d['ys{}'.format(n)][index] = list(df[p['y']])
    n += 1
print(d)

我想知道是否可以在熊貓或numpy上做一些技巧而不是在循環上構建字典。 如果結果是pandas數據框而不是字典，這對我也有好處，甚至更好，但是如果效率更高我就不會。

有什么想法嗎？

Answer 1

根據您的輸入和期望的輸出（每個鍵在列表中是兩次相同的值嗎？），至少可以用for p in plots方式替換for p in plots的for p in plots ：

for p in plots:
    # Select the data you want
    df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
    df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == 123)]   # I have used 123 to simplify, actually the value is an integer variable
    df.sort_values(['C'], ascending=[True], inplace=True)
    # fill the dictionary
    d['xs{}'.format(n)] = [list(df[p['x']]) for x in range(N)]
    d['ys{}'.format(n)] = [list(df[p['y']]) for x in range(N)]
    n += 1

至少您將for index in range(3)保存for index in range(3)並且對bigger_df 3次相同的操作。 隨着時間的timeit我的代碼從210毫秒下降到了70.5毫秒（大約三分之一）。

編輯：以您重新定義問題的方式，我認為這可以完成您想要的工作：

# put this code after the definition of plots
s_list = [123, 145, 35]
# create an empty DF to add your results in the loop
df_output = pd.DataFrame(index=s_list, columns=['xs0','ys0', 'xs1', 'ys1', 'xs2', 'ys2']) 
n = 0
for p in plots:
    # Select the data you want and sort them on the same line
    df_p = bigger_df[bigger_df['F'].isin([2, 3, 4, 9]) & bigger_df[p['x']].notnull() & bigger_df[p['y']].notnull() & bigger_df['S'].isin(s_list)].sort_values(['C'], ascending=[True])
    # on bigger df I would do a bit differently if the isin on F and S are the same for the three plots, 
    # I would create a df_select_FS outside of the loop before (might be faster)

    #  Now, you can do groupby on S and then you create a list of element in column p['x'] (and same for p['y'])
    # and you add them in you empty df_output in the right column
    df_output['xs{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['x']]))
    df_output['ys{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['y']]))
    n += 1

兩個注意事項：首先，如果在s_list具有兩倍相同的值，則它可能無法按您想要的方式工作，其次，在不滿足條件的情況下（例如在S中的示例145），然后在df_output包含nan

如何通過從另一個更大的數據框中選擇一些數據列表來有效地構建熊貓數據框（或字典）？

問題描述

1 個解決方案

解決方案1
1 已采納 2018-04-25 15:21:49

如何通過從另一個更大的數據框中選擇一些數據列表來有效地構建熊貓數據框（或字典）？

問題描述

1 個解決方案

解決方案1 1 已采納 2018-04-25 15:21:49

解決方案1
1 已采納 2018-04-25 15:21:49