简体   繁体   English

如何遍历 Python 数据框,偶尔需要就地用一行替换两行?

[英]How do I iterate through a Python data frame where I occasionally need to replace two rows with one row, in-place?

The "in-place" is the aspect I struggle with- creating a brand new data frame is solved (provided at the end). “就地”是我努力解决的问题——创建一个全新的数据框架已解决(在最后提供)。 My specific issue is that my imported data occasionally splits one column's string into two substrings, placing the first substring on one row with its other columns of data and placing the second substring on the following row with NaN values for its other columns.我的具体问题是我导入的数据偶尔会将一列的字符串拆分为两个子字符串,将第一个 substring 与其他数据列放在一行中,并将第二个 substring 放在下一行,其他列的值为 NaN。

This is what the data frame should look like:这是数据框样子:

            Actor   Color  Number
0       Amy Adams     red       1
1       Bill Burr  orange       2
2    Courtney Cox  yellow       3
3    Danny DeVito   green       4
4  Emilio Estevez    blue       5

This is what my imported data frame initially looks like, where "Courtney Cox" and "Emilio Estevez" have been split into two rows.这是我导入的数据框最初的样子,其中“Courtney Cox”和“Emilio Estevez”被分成两行。 I provided the code to create this data frame.我提供了创建此数据框的代码。 (Don't worry about the shift from integer to float- it's irrelevant.) (不要担心从 integer 到 float 的转变——这无关紧要。)

          Actor   Color  Number
0     Amy Adams     red     1.0
1     Bill Burr  orange     2.0
2      Courtney  yellow     3.0
3           Cox     NaN     NaN
4  Danny DeVito   green     4.0
5        Emilio    blue     5.0
6       Estevez     NaN     NaN

bad_df = pd.DataFrame({'Actor': ['Amy Adams','Bill Burr','Courtney','Cox','Danny DeVito','Emilio','Estevez'],
                       'Color':['red','orange','yellow',np.nan,'green','blue',np.nan],
                       'Number':[1,2,3,np.nan,4,5,np.nan]})

I do have access to the correct list for the Actor column.我确实可以访问 Actor 列的正确列表。

actor_list = ['Amy Adams','Bill Burr','Courtney Cox','Danny DeVito','Emilio Estevez']

My data frames are actually pretty small, so copying the data frame or creating a separate data frame isn't a problem, but it seems like I should be able to perform my fix in-place.我的数据框实际上非常小,因此复制数据框或创建单独的数据框不是问题,但似乎我应该能够就地执行修复。

Here's my current approach (iteratively creating a new data frame), but it seems sloppy.这是我目前的方法(迭代创建一个新的数据框),但它看起来很草率。 I iterate through a zip where each element consists of the index of a row, the row's Actor string, and the next row's Actor string.我遍历 zip,其中每个元素都包含一行的索引、该行的 Actor 字符串和下一行的 Actor 字符串。 However, I have to do the last row outside of the loop so I don't look for a "next row" that doesn't exist.但是,我必须在循环之外执行最后一行,这样我就不会寻找不存在的“下一行”。

new_df = pd.DataFrame()
for a1idx, a1, a2 in zip(bad_df.iloc[:-1,0].index, bad_df.iloc[:-1,0], bad_df.iloc[1:,0]):
    if a1 in actor_list: # First and last name are in this row
        new_df = new_df.append(bad_df.iloc[a1idx,:]) # Add row
    elif a1 + ' ' + a2 in actor_list: # First and last name are in consecutive rows
        new_df = new_df.append(bad_df.iloc[a1idx,:]) # Add row
        new_df.iloc[-1,0] = a1 + ' ' + a2 # Correct name in row
    # If neither of the above if conditions are met, this means we're inefficiently
    # looking at a row with just a last name which was dealt with in the previous iteration
if bad_df.iloc[-1,0] in actor_list: # Check very last row of data frame
    new_df = new_df.append(bad_df.iloc[-1,:]) # Add row

Is there a way to do this in-place?有没有办法就地执行此操作?

Would that be a better way?那会是更好的方法吗?

import pandas as pd

bad_df = pd.DataFrame({'Actor': ['Amy Adams','Bill Burr','Courtney','Cox','Danny DeVito','Emilio','Estevez'],
                       'Color':['red','orange','yellow',np.nan,'green','blue',np.nan],
                       'Number':[1,2,3,np.nan,4,5,np.nan]})

actor_list = ['Amy Adams','Bill Burr','Courtney Cox','Danny DeVito','Emilio Estevez']

nan_index = bad_df['Color'].isna()
bad_df.loc[nan_index, 'last_names'] = bad_df['Actor'][nan_index]
bad_df['last_names'] = bad_df['last_names'].shift(-1)
mask = pd.Series(nan_index).shift(-1, fill_value=False)
bad_df.loc[mask, 'Actor'] = bad_df['Actor'].str.cat(bad_df['last_names'], sep=' ')
bad_df.drop('last_names', axis=1, inplace=True)
bad_df = bad_df[~nan_index]

print(bad_df)

Output: Output:

            Actor   Color  Number
0       Amy Adams     red     1.0
1       Bill Burr  orange     2.0
2    Courtney Cox  yellow     3.0
4    Danny DeVito   green     4.0
5  Emilio Estevez    blue     5.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何遍历数据框以打印两个列表列的所有可能组合? - How do I iterate through a data frame to print all possible combinations of two list columns? 如何迭代此数据框 - 第一行没有第 1 列 - How do I iterate this Data frame - First row has no column 1 如何删除 Python 中该行中出现特定值的数据框行? - How can I remove the rows of a data frame where a certain value appears in that row in Python? 如何遍历两个数据帧中的数据并保留第一个数据帧的索引? - How can I iterate through data in two data frames and keep the index of my first data frame? 我是否需要遍历每一行数据来计算每列类别的时间? - Do I need to iterate through every row of data to calculate time per column category? 如何在Python中迭代字符串? - How do I iterate through a string in Python? 如何在Python的一个新行中追加两行列表? - How do I append two lists of rows into one new row in python? 如何用不同数据帧中的另一个数据替换数据帧中的一个单元? - How do I replace one cell in a data frame with another data in a different data frame? 如何遍历 df 列(其中每一行都是一个列表),在不同的列表中查找元素? - How do I iterate through a df column (where each row is a list), looking for elements in a different list? Python 数据框:如何处理行? - Python Data Frame: How do I work with rows?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM