简体   繁体   中英

Filter pandas dataframe based on different dtypes

I'm creating an initial df from a csv file like the following:

knobs_df = pd.read_csv(knobs_container)
        name     type                                             values
0  algorithm   string                                      one;two;three
1    threads  int32_t                1;2;3;4;5;6;7;8;9;10;11;12;13;14;15

For every row I extract into k_values and k_type the type column and the values column as dictionaries.

    k_values = {}
    k_types = {}
    for row in knobs_df.itertuples(index=False):
        k_values[row[0]] = row[2].split(';')
        k_types[row[0]] = row[1]

{'algorithm': ['one', 'two', 'three'], 'threads': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15']}
{'algorithm': 'string', 'threads': 'int32_t'}

From the k_values dictionary I generate a full grid containing all the possible combinations.

   algorithm threads
0        one       1
1        two       1
2      three       1
3        one       2
4        two       2
..       ...     ...
88       two      14
89     three      14
90       one      15
91       two      15
92     three      15

Having a list of constraints (Python expressions) like the following

['threads < 20', 'algorithm != "two"']

I'd like to filter the full-grid dataframe using the query method from pandas.DataFrame . Is there a way to assign each column with its coresponding dtype based on the k_types dictionary? I need to do this because every column has potentially an independent type and, for instance, the query method fails in filtering the 'threads' column since all columns are inferred by default to 'str' during creation. Problem is that since the types are C++ datatypes originally, I don't know if there's a way to achieve this.

Possible k_types are:

[string, short int, int8_t, int16_t, int32_t, int64_t, uint8_t, uint16_t, uint32_t, uint64_t, char, int, long int, long long int, int_fast8_t, int_fast16_t, int_fast32_t, int_fast64_t, int_least8_t, int_least_16_t, int_least32_t, int_least64_t, unsigned short int, unsigned char, unsigned int, unsigned long int, unsigned long long int, uint_fast8_t, uint_fast16_t, uint_fast32_t, uint_fast64_t, uint_least8_t, uint_least16_t, uint_least32_t, uint_least64_t, intmax_t, intptr_t, uintmax_t, uintptr_t, float, double, long double]

i managed to find an incomplete solution due to some misunderstanding. please let me know how to make this solution fit your needs:

t_df = df.T
names = t_df.loc['name']
dtypes = t_df.loc['type']
t_df.columns =  names
t_df = t_df.iloc[2:]
dtype_conv = {'string':str,'int32_t':int}
for dtype,name in zip(dtypes,names):
    t_df[name] = t_df[name].str.split(';')
    t_df=t_df.explode(name)
    t_df[name]  =t_df[name].astype(dtype_conv[dtype])
t_df.sort_values('threads').reset_index(drop=True)

output:

algorithm   threads
0   one     1
1   two     1
2   three   1
3   one     2
4   two     2
5   three   2
6   one     3
7   two     3
...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM