简体   繁体   English


[英]How to build a pandas dataframe (or dict) in an efficient way by selecting some lists of data from another bigger dataframe?

I need to create a DataFrame or dictionary. 我需要创建一个DataFrame或字典。 If N = 3 (number of lists inside other list) the expected output is this: 如果N = 3 (其他列表内的列表数),则预期输出为:

d = {
    'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
    'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
    'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
    'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
    'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
    'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]

For this I have programmed the following code. 为此,我编写了以下代码。 But I need this code to run faster: 但是我需要以下代码来更快地运行:

import numpy as np
import pandas as pd

df_dict = {
    'X1': [1, 2, 3, 4, 5, 6, 7, 8, np.nan],
    'Y1': [9, 29, 39, 49, np.nan, 69, 79, 89, 99],
    'X2': [11, 12, 13, 14, 15, 16, 17, 18, np.nan],
    'Y2': [119, 129, 139, 149, np.nan, 169, 179, 189, 199],
    'X3': [21, 22, 23, 24, 25, 26, 27, 28, np.nan],
    'Y3': [219, 229, 239, 249, np.nan, 269, 279, 289, 299],
    'S': [123, 11, 123, 11, 123, 123, 123, 35, 123],
    'C': [9, 8, 7, 6, 5, 4, 3, 2, 1],
    'F': [1, 1, 1, 1, 2, 3, 3, 3, 3],
    'OTHER': [10, 20, 30, 40, 50, 60, 70, 80, 90],
bigger_df = pd.DataFrame(df_dict)

plots = [
    { 'x': 'X1', 'y': 'Y1', },
    { 'x': 'X2', 'y': 'Y2', },
    { 'x': 'X3', 'y': 'Y3', }

N = 3
d = {}
s_list = [123, 145, 35]
n = 0
for p in plots:
    d['xs{}'.format(n)] = [[] for x in range(N)]
    d['ys{}'.format(n)] = [[] for x in range(N)]        

    for index in range(3):
        df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
        df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == s_list[index])]
        df.sort_values(['C'], ascending=[True], inplace=True)

        d['xs{}'.format(n)][index] = list(df[p['x']])
        d['ys{}'.format(n)][index] = list(df[p['y']])
    n += 1

I am wondering if instead of building the dictionary on a loop I could do some trick with pandas or numpy. 我想知道是否可以在熊猫或numpy上做一些技巧而不是在循环上构建字典。 If the result is a pandas dataframe rather than a dictionary is also good for me, or even better, but I do not if it will be more efficient. 如果结果是pandas数据框而不是字典,这对我也有好处,甚至更好,但是如果效率更高我就不会。

Some ideas? 有什么想法吗?

Depending on your input and your expected output (three time the same couple of values in your list for each key?), at least you can replace your for p in plots by: 根据您的输入和期望的输出(每个键在列表中是两次相同的值吗?),至少可以用for p in plots方式替换for p in plotsfor p in plots

for p in plots:
    # Select the data you want
    df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
    df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == 123)]   # I have used 123 to simplify, actually the value is an integer variable
    df.sort_values(['C'], ascending=[True], inplace=True)
    # fill the dictionary
    d['xs{}'.format(n)] = [list(df[p['x']]) for x in range(N)]
    d['ys{}'.format(n)] = [list(df[p['y']]) for x in range(N)]
    n += 1

At least you save the for index in range(3) and doing the same operation on your bigger_df 3 times. 至少您将for index in range(3)保存for index in range(3)并且对bigger_df 3次相同的操作。 With timeit I dropped from 210 ms with your code to 70.5 ms (around a third) with this one. 随着时间的timeit我的代码从210毫秒下降到了70.5毫秒(大约三分之一)。

EDIT : with the way you redefine your question, I think this might do the job you want: 编辑 :以您重新定义问题的方式,我认为这可以完成您想要的工作:

# put this code after the definition of plots
s_list = [123, 145, 35]
# create an empty DF to add your results in the loop
df_output = pd.DataFrame(index=s_list, columns=['xs0','ys0', 'xs1', 'ys1', 'xs2', 'ys2']) 
n = 0
for p in plots:
    # Select the data you want and sort them on the same line
    df_p = bigger_df[bigger_df['F'].isin([2, 3, 4, 9]) & bigger_df[p['x']].notnull() & bigger_df[p['y']].notnull() & bigger_df['S'].isin(s_list)].sort_values(['C'], ascending=[True])
    # on bigger df I would do a bit differently if the isin on F and S are the same for the three plots, 
    # I would create a df_select_FS outside of the loop before (might be faster)

    #  Now, you can do groupby on S and then you create a list of element in column p['x'] (and same for p['y'])
    # and you add them in you empty df_output in the right column
    df_output['xs{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['x']]))
    df_output['ys{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['y']]))
    n += 1

Two notes: first if in your s_list you have twice the same value, it might not work the way you want, second where the condition are not meet (like in your example 145 in S ) then you have nan in your df_output 两个注意事项:首先,如果在s_list具有两倍相同的值,则它可能无法按您想要的方式工作,其次,在不满足条件的情况下(例如在S中的示例145),然后在df_output包含nan

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM