简体   繁体   English

Python pandas 根据条件将数据附加到下一列

[英]Python pandas appending data to next column based on conditions

I'm working on a dataframe with around 200k records that looks like this (information replaced with random text):我正在处理一个包含大约 20 万条记录的数据框,看起来像这样(信息替换为随机文本):

ID                Description         
1                 Eg.1
2                 Desc.2
3                 Desc.3
80                 
aaa
output
500                
c                   
d
e
f
input
100              Desc.100
200              Desc.200

I have set it up in a pandas dataframe and was thinking I could do something like:我已将其设置在熊猫数据框中,并认为我可以执行以下操作:

for x in df['ID'] :
    if type(df['ID'][x]) == str:
        df['Description'][x-1] += ' ' + df['ID'][x].values       

To try and append the faulty text in ID (below is the desired outcome that I want to get)尝试在 ID 中附加错误的文本(以下是我想要的结果)

ID                Description         
1                 Eg.1
2                 Desc.2
3                 Desc.3
80                aaa output
500               c d e f input         
100               Desc.100

Where only numerics are kept in ID column and all descriptions are appended to the previous correct ID. ID 列中仅保留数字,并且所有描述都附加到之前的正确 ID 中。 (another issue is that the number of faulty text under ids range from 1 to 10 in some cases) (另一个问题是ids下的错误文本数量在某些情况下从1到10不等)

I'm a bit stuck since x in the above code returns the string that was found in the df['ID'] section, any thoughts on how this could be accomplished in a relatively fast way across the 200k+ records?由于上面代码中的 x 返回在 df['ID'] 部分中找到的字符串,我有点卡住了,关于如何在 200k+ 记录中以相对快速的方式完成此操作的任何想法?

Thanks!谢谢!

Here's an idea on how to do it in pandas:这是关于如何在熊猫中做到这一点的想法:

I read your example from the clipboard我从剪贴板读了你的例子

import pandas as pd
import numpy as np
df = pd.read_clipboard()

First I copied the string indexes into the description where the ID was a string.首先,我将字符串索引复制到 ID 是字符串的描述中。 Because it should go in the description field.因为它应该出现在描述字段中。 I'm using the str(x).isnumeric() to treat each cell as string, even if it isn't.我使用str(x).isnumeric()将每个单元格视为字符串,即使它不是。 If some of the cells are imported as numbers and some are as strings, the .isnumeric part will cause an error on number typed fields.如果某些单元格以数字形式导入而某些单元格以字符串形式导入,则.isnumeric部分将导致数字类型字段出现错误。

df.loc[df['ID'].apply(lambda x: not str(x).isnumeric()), 'Description'] = df['ID']

Then I emptied the ID from those entries rows only然后我只从这些条目行中清空了 ID

df.loc[df['ID'].apply(lambda x: not str(x).isnumeric()), 'ID'] = np.NaN

I filled the now empty ID with the previous line ID我用前一行 ID 填充了现在为空的 ID

df['ID'] = df['ID'].fillna(method='ffill')

As the first line of each of these groups is still empty, I drop it and group the rest由于这些组中的每一个的第一行仍然是空的,我将其删除并将其余的分组

df_result = df.dropna().groupby('ID', sort=False).aggregate(lambda x: ' '.join(x))

print (df_result)

Something to consider: if the broken data is not in dataframe, but in a file, I'd probably write code that goes through the file line by line and writes the fixed lines into a correction file.需要考虑的事情:如果损坏的数据不在数据帧中,而是在文件中,我可能会编写代码,逐行遍历文件并将固定行写入更正文件。 This would not require the 200k lines to be in memory at the same time, which would make the process easier, because you only have to run the fix once.这不需要 200k 行同时在内存中,这将使过程更容易,因为您只需要运行一次修复程序。

You can try of keeping only numeric value in 'ID' by assigning non numeric id info to description.您可以尝试通过将非数字 id 信息分配给描述来仅在“ID”中保留数字值。 After forward fill the ID apply groupby and join the description.转发后填写ID apply groupby并加入描述。

df['Description'] = df.apply(lambda x : x['Description'] if x['ID'].isdigit() else x["ID"],1).fillna('')
df['ID'] = df.ID.apply(lambda x:x if x.isdigit() else np.nan).fillna(method='ffill')
df = pd.DataFrame(df.groupby(['ID'],sort=False)['Description'].apply(lambda x: ' '.join(x))).reset_index()

Out:出去:

   ID   Description
0   1   Eg.1
1   2   Desc.2
2   3   Desc.3
3   80  aaa output
4   500 c d e f input
5   100 Desc.100
6   200 Desc.200

This uses numpy almost exclusively.这几乎完全使用 numpy。 It is faster than the pandas groupby methods even though the code is longer.即使代码更长,它也比 Pandas groupby 方法更快。 Repeating numerical values in the ID column are OK (all numerical rows will be returned whether or not they are duplicated as the code stands now).在 ID 列中重复数字值是可以的(所有数字行都将返回,无论它们是否像现在的代码一样重复)。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': ['1', '2', '3', '80', 'aaa',
                           'output', '500', 'c', 'd',
                           'e', 'f', 'input', '100', '200'],
                   'Description': ['Eg.1', 'Desc.2', 'Desc.3',
                                   '', '', '', '', '', '', '',
                                   '', '', 'Desc.100', 'Desc.200']})

IDs = df.ID.values

# numeric test function for ID column
def isnumeric(s):
    try:
        float(s)
        return 1
    except ValueError:
        return 0

# find the rows which are numeric and mark with 1 (vs 0)
nums = np.frompyfunc(isnumeric, 1, 1)(IDs).astype(int)

# make another array, which marks
# str IDs with a 1 (opposite of nums)
strs = 1 - nums

# make arrays to hold shifted arrays of strs and nums
nums_copy = np.empty_like(nums)
strs_copy = np.empty_like(strs)

# make an array of nums shifted fwd 1
nums_copy[0] = 1
nums_copy[1:] = nums[:-1]

# make an array of strs shifted back 1
strs_copy[-1] = 0
strs_copy[:-1] = strs[1:]

# make arrays to detect where str and num
# ID segments begin and end
str_idx = strs + nums_copy
num_idx = nums + strs_copy

# find indexes of start and end of ID str segments
starts = np.where(str_idx == 2)[0]
ends = np.where(str_idx == 0)[0]

# make a continuous array of IDs which
# were marked as strings
txt = IDs[np.where(strs)[0]]
# split that array into string segments which will
# become a combined string row value
txt_arrs = np.split(txt, np.cumsum(ends - starts)[:-1])
# join the string segment arrays
txt_arrs = [' '.join(x) for x in txt_arrs]

# find the row indexes which will contain combined strings
combo_str_locs = np.where(num_idx == 2)[0][:len(txt_arrs)]
# put the combined strings into the Description column
# at the proper indexes
np.put(df.Description, combo_str_locs, txt_arrs)
# slice the original dataframe to retain only numeric
# ID rows
df = df.iloc[np.where(nums == 1)[0]]

# If a new index is desired >> df.reset_index(inplace=True, drop=True) 

Other approach could be as show below:其他方法可能如下所示:

Input data:输入数据:

df = pd.DataFrame({'ID': ['1', '2', '3', '80', 'aaa', 'output', '500', 'c', 'd', 'e', 'f', 'input', '100', '200'],
                   'Description': ['Eg.1', 'Desc.2', 'Desc.3', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Desc.100', 'Desc.200']})

Logic to process dataframe to get the desired result:处理数据帧以获得所需结果的逻辑:

df['IsDigit'] = df['ID'].str.isdigit()
df['Group'] = df['IsDigit'].ne(df['IsDigit'].shift()).cumsum()
dfG = df[df['IsDigit'] == False].groupby(['Group'])['ID'].apply(lambda x: ' '.join(x))
df = df.drop(df[df['IsDigit'] == False].index)
df.loc[df['Description'].isna(), 'Description'] = df[df['Description'].isna()].apply(lambda x: dfG[x['Group'] + 1], axis=1)
df = df.drop(columns=['IsDigit', 'Group']).set_index('ID')

And it produces below output:它产生以下输出:

       Description
ID                
1             Eg.1
2           Desc.2
3           Desc.3
80      aaa output
500  c d e f input
100       Desc.100
200       Desc.200

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM