简体   繁体   中英

How to build a pandas dataframe (or dict) in an efficient way by selecting some lists of data from another bigger dataframe?

I need to create a DataFrame or dictionary. If N = 3 (number of lists inside other list) the expected output is this:

d = {
    'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
    'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
    'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
    'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
    'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
    'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}

For this I have programmed the following code. But I need this code to run faster:

import numpy as np
import pandas as pd

df_dict = {
    'X1': [1, 2, 3, 4, 5, 6, 7, 8, np.nan],
    'Y1': [9, 29, 39, 49, np.nan, 69, 79, 89, 99],
    'X2': [11, 12, 13, 14, 15, 16, 17, 18, np.nan],
    'Y2': [119, 129, 139, 149, np.nan, 169, 179, 189, 199],
    'X3': [21, 22, 23, 24, 25, 26, 27, 28, np.nan],
    'Y3': [219, 229, 239, 249, np.nan, 269, 279, 289, 299],
    'S': [123, 11, 123, 11, 123, 123, 123, 35, 123],
    'C': [9, 8, 7, 6, 5, 4, 3, 2, 1],
    'F': [1, 1, 1, 1, 2, 3, 3, 3, 3],
    'OTHER': [10, 20, 30, 40, 50, 60, 70, 80, 90],
}
bigger_df = pd.DataFrame(df_dict)

plots = [
    { 'x': 'X1', 'y': 'Y1', },
    { 'x': 'X2', 'y': 'Y2', },
    { 'x': 'X3', 'y': 'Y3', }
]

N = 3
d = {}
s_list = [123, 145, 35]
n = 0
for p in plots:
    # INITIALIZATES THE DICTIONARY ELEMENTS
    d['xs{}'.format(n)] = [[] for x in range(N)]
    d['ys{}'.format(n)] = [[] for x in range(N)]        

    # BUILDS THE LISTS FOR THOSE ELEMENTS
    for index in range(3):
        df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
        df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == s_list[index])]
        df.sort_values(['C'], ascending=[True], inplace=True)

        d['xs{}'.format(n)][index] = list(df[p['x']])
        d['ys{}'.format(n)][index] = list(df[p['y']])
    n += 1
print(d)

I am wondering if instead of building the dictionary on a loop I could do some trick with pandas or numpy. If the result is a pandas dataframe rather than a dictionary is also good for me, or even better, but I do not if it will be more efficient.

Some ideas?

Depending on your input and your expected output (three time the same couple of values in your list for each key?), at least you can replace your for p in plots by:

for p in plots:
    # Select the data you want
    df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
    df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == 123)]   # I have used 123 to simplify, actually the value is an integer variable
    df.sort_values(['C'], ascending=[True], inplace=True)
    # fill the dictionary
    d['xs{}'.format(n)] = [list(df[p['x']]) for x in range(N)]
    d['ys{}'.format(n)] = [list(df[p['y']]) for x in range(N)]
    n += 1

At least you save the for index in range(3) and doing the same operation on your bigger_df 3 times. With timeit I dropped from 210 ms with your code to 70.5 ms (around a third) with this one.

EDIT : with the way you redefine your question, I think this might do the job you want:

# put this code after the definition of plots
s_list = [123, 145, 35]
# create an empty DF to add your results in the loop
df_output = pd.DataFrame(index=s_list, columns=['xs0','ys0', 'xs1', 'ys1', 'xs2', 'ys2']) 
n = 0
for p in plots:
    # Select the data you want and sort them on the same line
    df_p = bigger_df[bigger_df['F'].isin([2, 3, 4, 9]) & bigger_df[p['x']].notnull() & bigger_df[p['y']].notnull() & bigger_df['S'].isin(s_list)].sort_values(['C'], ascending=[True])
    # on bigger df I would do a bit differently if the isin on F and S are the same for the three plots, 
    # I would create a df_select_FS outside of the loop before (might be faster)

    #  Now, you can do groupby on S and then you create a list of element in column p['x'] (and same for p['y'])
    # and you add them in you empty df_output in the right column
    df_output['xs{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['x']]))
    df_output['ys{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['y']]))
    n += 1

Two notes: first if in your s_list you have twice the same value, it might not work the way you want, second where the condition are not meet (like in your example 145 in S ) then you have nan in your df_output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM