如何遍历 Python 数据框，偶尔需要就地用一行替换两行？

Question

The "in-place" is the aspect I struggle with- creating a brand new data frame is solved (provided at the end). “就地”是我努力解决的问题——创建一个全新的数据框架已解决（在最后提供）。 My specific issue is that my imported data occasionally splits one column's string into two substrings, placing the first substring on one row with its other columns of data and placing the second substring on the following row with NaN values for its other columns.我的具体问题是我导入的数据偶尔会将一列的字符串拆分为两个子字符串，将第一个 substring 与其他数据列放在一行中，并将第二个 substring 放在下一行，其他列的值为 NaN。

This is what the data frame should look like:这是数据框的样子：

            Actor   Color  Number
0       Amy Adams     red       1
1       Bill Burr  orange       2
2    Courtney Cox  yellow       3
3    Danny DeVito   green       4
4  Emilio Estevez    blue       5

This is what my imported data frame initially looks like, where "Courtney Cox" and "Emilio Estevez" have been split into two rows.这是我导入的数据框最初的样子，其中“Courtney Cox”和“Emilio Estevez”被分成两行。 I provided the code to create this data frame.我提供了创建此数据框的代码。 (Don't worry about the shift from integer to float- it's irrelevant.) （不要担心从 integer 到 float 的转变——这无关紧要。）

          Actor   Color  Number
0     Amy Adams     red     1.0
1     Bill Burr  orange     2.0
2      Courtney  yellow     3.0
3           Cox     NaN     NaN
4  Danny DeVito   green     4.0
5        Emilio    blue     5.0
6       Estevez     NaN     NaN

bad_df = pd.DataFrame({'Actor': ['Amy Adams','Bill Burr','Courtney','Cox','Danny DeVito','Emilio','Estevez'],
                       'Color':['red','orange','yellow',np.nan,'green','blue',np.nan],
                       'Number':[1,2,3,np.nan,4,5,np.nan]})

I do have access to the correct list for the Actor column.我确实可以访问 Actor 列的正确列表。

actor_list = ['Amy Adams','Bill Burr','Courtney Cox','Danny DeVito','Emilio Estevez']

My data frames are actually pretty small, so copying the data frame or creating a separate data frame isn't a problem, but it seems like I should be able to perform my fix in-place.我的数据框实际上非常小，因此复制数据框或创建单独的数据框不是问题，但似乎我应该能够就地执行修复。

Here's my current approach (iteratively creating a new data frame), but it seems sloppy.这是我目前的方法（迭代创建一个新的数据框），但它看起来很草率。 I iterate through a zip where each element consists of the index of a row, the row's Actor string, and the next row's Actor string.我遍历 zip，其中每个元素都包含一行的索引、该行的 Actor 字符串和下一行的 Actor 字符串。 However, I have to do the last row outside of the loop so I don't look for a "next row" that doesn't exist.但是，我必须在循环之外执行最后一行，这样我就不会寻找不存在的“下一行”。

new_df = pd.DataFrame()
for a1idx, a1, a2 in zip(bad_df.iloc[:-1,0].index, bad_df.iloc[:-1,0], bad_df.iloc[1:,0]):
    if a1 in actor_list: # First and last name are in this row
        new_df = new_df.append(bad_df.iloc[a1idx,:]) # Add row
    elif a1 + ' ' + a2 in actor_list: # First and last name are in consecutive rows
        new_df = new_df.append(bad_df.iloc[a1idx,:]) # Add row
        new_df.iloc[-1,0] = a1 + ' ' + a2 # Correct name in row
    # If neither of the above if conditions are met, this means we're inefficiently
    # looking at a row with just a last name which was dealt with in the previous iteration
if bad_df.iloc[-1,0] in actor_list: # Check very last row of data frame
    new_df = new_df.append(bad_df.iloc[-1,:]) # Add row

Is there a way to do this in-place?有没有办法就地执行此操作？

Answer 1

Would that be a better way?那会是更好的方法吗？

import pandas as pd

bad_df = pd.DataFrame({'Actor': ['Amy Adams','Bill Burr','Courtney','Cox','Danny DeVito','Emilio','Estevez'],
                       'Color':['red','orange','yellow',np.nan,'green','blue',np.nan],
                       'Number':[1,2,3,np.nan,4,5,np.nan]})

actor_list = ['Amy Adams','Bill Burr','Courtney Cox','Danny DeVito','Emilio Estevez']

nan_index = bad_df['Color'].isna()
bad_df.loc[nan_index, 'last_names'] = bad_df['Actor'][nan_index]
bad_df['last_names'] = bad_df['last_names'].shift(-1)
mask = pd.Series(nan_index).shift(-1, fill_value=False)
bad_df.loc[mask, 'Actor'] = bad_df['Actor'].str.cat(bad_df['last_names'], sep=' ')
bad_df.drop('last_names', axis=1, inplace=True)
bad_df = bad_df[~nan_index]

print(bad_df)

Output: Output：

            Actor   Color  Number
0       Amy Adams     red     1.0
1       Bill Burr  orange     2.0
2    Courtney Cox  yellow     3.0
4    Danny DeVito   green     4.0
5  Emilio Estevez    blue     5.0

如何遍历 Python 数据框，偶尔需要就地用一行替换两行？

问题描述

1 个解决方案

解决方案1
1 2021-01-13 19:37:25

如何遍历 Python 数据框，偶尔需要就地用一行替换两行？

问题描述

1 个解决方案

解决方案1 1 2021-01-13 19:37:25

解决方案1
1 2021-01-13 19:37:25