简体   繁体   中英

Fastest way to convert python iterator output to pandas dataframe

I have a generator that returns an unknown number of rows of data that I want to convert to an indexed pandas dataframe. The fastest way I know of is to write a CSV to disk then parse back in via 'read_csv'. I'm aware that it is not efficient to create an empty dataframe then constantly append new rows. I can't create a pre-sized dataframe because I do not know how many rows will be returned. Is there a way to convert the iterator output to a pandas dataframe without writing to disk?

Iteratively appending to a pandas data frame is not the best solution. It is better to build your data as a list, and then pass it to pd.DataFrame .

import random
import pandas as pd

alpha = list('abcdefghijklmnopqrstuvwxyz')

Here we create a generator, use it to construct a list, then pass it to the dataframe constructor:

%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
my_data = [x for x in gen]
df = pd.DataFrame(my_data, columns=['letter','value'])

# result: 1 loop, best of 3: 373 ms per loop

This is quite a bit faster than creating a generator, construct an empty dataframe, and appending rows, seen here:

%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
df = pd.DataFrame(columns=['letter','value'])
for tup in gen:
    df.loc[df.shape[0],:] = tup

# result: 1 loop, best of 3: 13.6 s per loop

This is incredibly slow at 13 seconds to construct 10000 rows.

Would something general like this do the trick?

def make_equal_length_cols(df, new_iter, col_name):
    # convert the generator to a list so we can append
    new_iter = list(new_iter)
    # if the passed generator (as a list) has fewer elements that the dataframe, we ought to add NaN elements until their lengths are equal
    if len(new_iter) < df.shape[0]:
        new_iter += [np.nan]*(df.shape[0]-len(new_iter))
    else:
        # otherwise, each column gets n new NaN rows, where n is the difference between the number of elements in new_iter and the length of the dataframe
        new_rows = [{c: np.nan for c in df.columns} for _ in range((len(new_iter)-df.shape[0]))]
        new_rows_df = pd.DataFrame(new_rows)
        df = df.append(new_rows_df).reset_index(drop=True)
    df[col_name] = new_iter
    return df

Test it out:

make_equal_length_cols(df, (x for x in range(20)), 'new')
Out[22]: 
      A    B  new
0   0.0  0.0    0
1   1.0  1.0    1
2   2.0  2.0    2
3   3.0  3.0    3
4   4.0  4.0    4
5   5.0  5.0    5
6   6.0  6.0    6
7   7.0  7.0    7
8   8.0  8.0    8
9   9.0  9.0    9
10  NaN  NaN   10
11  NaN  NaN   11
12  NaN  NaN   12
13  NaN  NaN   13
14  NaN  NaN   14
15  NaN  NaN   15
16  NaN  NaN   16
17  NaN  NaN   17
18  NaN  NaN   18
19  NaN  NaN   19

And it also works when the passed generator is shorter than the dataframe:

make_equal_length_cols(df, (x for x in range(5)), 'new')
Out[26]: 
   A  B  new
0  0  0  0.0
1  1  1  1.0
2  2  2  2.0
3  3  3  3.0
4  4  4  4.0
5  5  5  NaN
6  6  6  NaN
7  7  7  NaN
8  8  8  NaN
9  9  9  NaN

Edit: removed row-by-row pandas.DataFrame.append call, and constructed separate dataframe to append in one shot. Timings:

New append:

%timeit make_equal_length_cols(df, (x for x in range(10000)), 'new')
10 loops, best of 3: 40.1 ms per loop

Old append:

very slow...

Pandas DataFrame accepts iterator as the data source in the constructor. You can dynamically generate rows and feed them to a data frame, as you are reading and transforming the source data.

This is easiest done by writing a generator function that uses yield to feed the results.

After the data frame has been generated you can use set_index to choose any column as an index.

Here is an example:

    def create_timeline(self) -> pd.DataFrame:
        """Create a timeline feed how we traded over a course of time.

        Note: We assume each position has only one enter and exit event, not position increases over the lifetime.

        :return: DataFrame with timestamp and timeline_event columns
        """

        # https://stackoverflow.com/questions/42999332/fastest-way-to-convert-python-iterator-output-to-pandas-dataframe
        def gen_events():
            """Generate data for the dataframe.

            Use Python generators to dynamically fill Pandas dataframe.
            Each dataframe gets timestamp, timeline_event columns.
            """
            for pair_id, history in self.asset_histories.items():
                for position in history.positions:
                    open_event = TimelineEvent(
                        pair_id=pair_id,
                        position=position,
                        type=TimelineEventType.open,
                    )
                    yield (position.opened_at, open_event)

                    # If position is closed generated two events
                    if position.is_closed():
                        close_event = TimelineEvent(
                            pair_id=pair_id,
                            position=position,
                            type=TimelineEventType.close,
                        )
                        yield (position.closed_at, close_event)

        df = pd.DataFrame(gen_events(), columns=["timestamp", "timeline_event"])
        df = df.set_index(["timestamp"])
        return df

The full open source example can be found here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM