简体   繁体   中英

How to insert a multidimensional numpy array to pandas column?

I have some numpy array, whose number of rows (axis=0) is the same as a pandas dataframe's number of rows.

I want to create a new column in the dataframe, for which each entry would be a numpy array of a lesser dimension.

Code:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6, 8)

    data = np.stack(some_df['A'].values)  #shape (10, 4, 6, 8)
    processed = np.max(data, axis=1)  # shape (10, 6, 8)

    some_df['B'] = processed  # This fails

I want the new column 'B' to contain numpy arrays of shape (6, 8)

How can this be done?

This is not recommended, it is pain, slow and later processing is not easy.

One possible solution is use list comprehension:

some_df['B'] = [x for x in processed]

Or convert to list and assign:

some_df['B'] = processed.tolist()

I know this question already has an answer to it, but I would like to add a much more scalable way of doing this. As mentioned in the comments above it is in general not recommended to store arrays as "field"-values in a pandas-Dataframe column (I actually do not know why?). Nevertheless, in my day to day work this is an extermely important functionality when working with time-series data and a bunch of related meta-data. In general I organize my experimantal time-series in form of pandas dataframes with one column holding same-length numpy arrays and the other columns containing information on meta-data with respect to certain measurement conditions etc.

The proposed solution by jezrael works very well, and I used this for the last 4 years on a regular basis. But this method potentially encounters huge memory problems. In my case I came across these problems working with dataframes beyond 5 Million rows and time-series with approx. 100 data points.

The solution to these problems is extremely simple, since I did not find it anywhere I just wanted to share it here: Simply transform your 2D array to a pandas-Series object and assign this to a column of your dataframe:

df["new_list_column"] = pd.Series(list(numpy_array_2D))

Coming back to this after 2 years, here is a much better practice:

from itertools import product, chain
import pandas as pd
import numpy as np
from typing import Dict


def calc_col_names(named_shape):
    *prefix, shape = named_shape
    names = [map(str, range(i)) for i in shape]
    return map('_'.join, product(prefix, *names))


def create_flat_columns_df_from_dict_of_numpy(
        named_np: Dict[str, np.array],
        n_samples_per_np: int,
):
    named_np_correct_lenth = {k: v for k, v in named_np.items() if len(v) == n_samples_per_np}
    flat_nps = [a.reshape(n_samples_per_np, -1) for a in named_np_correct_lenth.values()]
    stacked_nps = np.column_stack(flat_nps)
    named_shapes = [(name, arr.shape[1:]) for name, arr in named_np_correct_lenth.items()]
    col_names = [*chain.from_iterable(calc_col_names(named_shape) for named_shape in named_shapes)]
    df = pd.DataFrame(stacked_nps, columns=col_names)
    df = df.convert_dtypes()
    return df


def parse_series_into_np(df, col_name, shp):
    # can parse the shape from the col names
    n_samples = len(df)
    col_names = sorted(c for c in df.columns if col_name in c)
    col_names = list(filter(lambda c: c.startswith(col_name + "_") or len(col_names) == 1, col_names))
    col_as_np = df[col_names].astype(np.float).values.reshape((n_samples, *shp))
    return col_as_np

usage to put a ndarray into a Dataframe:

full_rate_df = create_flat_columns_df_from_dict_of_numpy(
    named_np={name: np.array(d[name]) for name in ["name1", "name2"]},
    n_samples_per_np=d["name1"].shape[0]
)

where d is a dict of nd arrays of the same shape[0] , hashed by ["name1", "name2"] .

The reverse operation can be obtained by parse_series_into_np .


The accepted answer remains, as it answers the original question, but this one is a much better practice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM