简体   繁体   中英

In a Pandas DataFrame, How to transform a list into multiple lists based on prefix of strings and preserve order? [Python]

I have a DataFrame on the following format

   Name  |   Events  
   ID1      [Event C, Loop 1 - A, Loop 1 - B, Loop 2 - A , Loop 2 - B]
   ID2      [Loop 1 - A, Event C, Loop 1 - D, Loop 2 - A , Loop 2 - E, Loop 2 - C, Loop 3 - A, Loop 3 - B]
    ...       ....

Which I want to split into rows depending on the Loop prefix, I also want to keep the events with no Loop prefix in all new rows. I need to preserve the order in the new rows as well.

   Name  |   Events  
   ID1      [Event C, Loop 1 - A, Loop 1 - B]
   ID1      [Event C, Loop 2 - A , Loop 2 - B]
   ID2      [Loop 1 - A, Event C, Loop 1 - D]
   ID2      [Event C, Loop 2 - A , Loop 2 - E, Loop 2 - C]
   ID2      [Event C, Loop 3 - A, Loop 3 - B]
    ...       ....

Is there any smart way of doing this?

  1. start by explode() on the list
  2. then pull out event and loop into temporary columns
  3. re-construct using groupby() \\ agg()
df = pd.DataFrame([{'Name': 'ID1',
  'Events': ['Event C',
    'Loop 1 - A',
    'Loop 1 - B',
    'Loop 2 - A',
    'Loop 2 - B']},
 {'Name': 'ID2',
  'Events': ['Loop 1 - A',
    'Event C',
    'Loop 1 - D', 'Loop 2 - A',
    'Loop 2 - E',
    'Loop 2 - C',
    'Loop 3 - A',
    'Loop 3 - B']}])

# start by exploding the list ...
df2 = (df.explode("Events").assign(
    # derive a column that is event
    e=lambda dfa: np.where(dfa["Events"].str.contains("Event"), dfa["Events"], np.nan),
    # use a re to get "Loop n" part of string
    l=lambda dfa: dfa["Events"].str.extract("^([\w]* [0-9])")
).assign(
    # need to ffill event for rows where it didn't exist
    e=lambda dfa: dfa["e"].fillna(method="ffill"),
)
     # get rid of rows where "l" has no value
     .dropna()
    # now recreate list - order will be preserved as there is no sort
    .groupby(["Name","e","l"]).agg({"Events":lambda s: list(s)})
    .reset_index()
     # put the event back into the list
    .assign(Events=lambda dfa: dfa.apply(lambda r: [r["e"]]+r["Events"], axis = 1))
     # cleanup temp columns
    .drop(columns=["e","l"])
)

output

Name                                         Events
 ID1              [Event C, Loop 1 - A, Loop 1 - B]
 ID1              [Event C, Loop 2 - A, Loop 2 - B]
 ID2              [Event C, Loop 1 - A, Loop 1 - D]
 ID2  [Event C, Loop 2 - A, Loop 2 - E, Loop 2 - C]
 ID2              [Event C, Loop 3 - A, Loop 3 - B]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM