简体   繁体   English

根据条件合并Dataframe的行

[英]Merge Rows of Dataframe based on condition

I have a csv file with only one column "notes". 我有一个只有一列“注释”的csv文件。 I want to merge rows of data-frame based on some condition. 我想根据某些条件合并数据帧的行。

Input_data={'notes':
            ['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']}

df_in = pd.DataFrame(Input_data) 

Input looks like this 输入看起来像这样

样本输入

Output 产量

output_Data={'notes':
             ['aaa','bbb','*hello','**my name is xyz',
              '(1) this is temp name',
              '(2) BTW how to solve this',
              '(3) with python','I don’t want this to be added ',
              'I don’t want this to be added ']}
df_out=pd.DataFrame(output_Data) 

I want to merge the rows with the above row which have either "*" or "(number)" in it. 我想将行与上面的行合并,其中包含"*""(number)" So the output will look like 所以输出看起来像

输出快照

Other rows which can not be merged should be left. 应该保留其他无法合并的行。 Also, in case of last row as there is no proper way to know up-to what range we can merge lets say just add only one next row I solved this but its very long. 此外,在最后一行的情况下,由于没有正确的方法来了解我们可以合并的范围,我们可以说只添加一个下一行我解决了这个但很长。 Any simpler way 任何更简单的方法

df=pd.DataFrame(Input_data)
notes=[];temp=[];flag='';value='';c=0;chk_star='yes'
for i,row in df.iterrows():
    row[0]=str(row[0])
    if '*' in row[0].strip()[:5] and chk_star=='yes':   
        value=row[0].strip()
        temp=temp+[value]
        value=''
        continue

    if '(' in row[0].strip()[:5]:
        chk_star='no'
        temp=temp+[value]
        value='';c=0
        flag='continue'
        value=row[0].strip()
    if flag=='continue' and '(' not in row[0][:5] : 
        value=value+row[0]
        c=c+1
    if c>4:
        temp=temp+[value] 
        print "111",value,temp
        break
if '' in temp:
    temp.remove('')
df=pd.DataFrame({'notes':temp})     

You can use a mask to avoid the for loop : 您可以使用掩码来避免for循环:

df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
         '(1)','this is ','temp ','name',
         '(2)','BTW ','how to ','solve this',
         '(3)','with python ','I don’t want this to be added ',
         'I don’t want this to be added ']})

special = ['*', '**']
for i in range(11):
    special.append('({})'.format(i))

# We find the indexes where we will have to merge
index_to_merge = df[df['row'].isin(special)].index.values
for idx, val in enumerate(index_to_merge):
    if idx != len(index_to_merge)-1:
        df.loc[val, 'row'] += ' ' + df.loc[val+1:index_to_merge[idx+1]-1, 'row'].values.sum()
    else:
        df.loc[index, 'row'] += ' ' + df.loc[index+1:, 'row'].values.sum()

# We delete the rows that we just used to merge
df.drop([x for x in np.array(range(len(df))) if x not in index_to_merge])

Out : 出:

        row
2   * hello
4   ** my nameis xyz
7   (1) this is temp name
11  (2) BTW how to solve this
15  (3) with python I don’t want this to be added ..

You could also convert your column into a numpy array and use numpy functions to simplify what you did. 您还可以将列转换为numpy数组,并使用numpy函数来简化您的操作。 First you can use the np.where and np.isin to find the indexes where you will have to merge. 首先,您可以使用np.wherenp.isin来查找必须合并的索引。 That way you don't have to iterate on your whole array using a for loop. 这样您就不必使用for循环遍历整个数组。

Then you can do the mergures on the corresponding indexes. 然后你可以在相应的索引上做mergures。 Finally, you can delete the values that have been merged. 最后,您可以删除已合并的值。 Here is what it could look like : 这是它的样子:

list_to_merge = np.array(['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added '])
special = ['*', '**']
for i in range(11):
    special.append('({})'.format(i))

ix = np.isin(list_to_merge, special)
rows_to_merge = np.where(ix)[0]

# We merge the rows
for index_to_merge in np.where(ix)[0]:
    # Check if there we are not trying to merge with an out of bounds value
    if index_to_merge!=len(list_to_merge)-1:
        list_to_merge[index_to_merge] = list_to_merge[index_to_merge] + ' ' + list_to_merge[index_to_merge+1]

# We delete the rows that have just been used to merge:
rows_to_delete = rows_to_merge +1
list_to_merge = np.delete(list_to_merge, rows_to_delete)

Out : 出:

['aaa', 'bbb', '* hello', '** my name', 'is xyz', '(1) this is',
       'temp', 'name', '(2) BTW', 'how to', 'solve this',
       '(3) with python', 'I don’t want this to be added ',
       'I don’t want this to be added ']

Below solution recognises special characters like *,** and (number) at the start of the of the sentence and starts merging later rows except last row. 下面的解决方案在句子的开头识别特殊字符,如*,**和(数字),并开始合并除最后一行之外的后续行。

import pandas as pd
import re
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']})



pattern = "^\(\d+\)|^\*+" #Pattern to identify string starting with (number),*,**.

#print(df)
#Selecting index based on the above pattern
selected_index = df[df["row"].str.contains(re.compile(pattern))].index.values
delete_index = []
for index in selected_index:
    i=1
    #Merging row until next selected index found and add merged rows to delete_index list
    while(index+i not in selected_index and index+i < len(df)-1):
        df.at[index, 'row'] += ' ' + df.at[index+i, 'row']
        delete_index.append(index+i)
        i+=1


df.drop(delete_index,inplace=True)
#print(df)

Output: 输出:

    row
0   aaa
1   bbb
2   *hello
4   **my nameis xyz
7   (1)this istempname
11  (2)BTWhow tosolve this
15  (3)with pythonI don’t want this to be added
18  I don’t want this to be added

You can reset index if you want. 您可以根据需要重置索引。 using df.reset_index() 使用df.reset_index()

I think it is easier when you design your logic to separate df_in into 3 parts: top, middle and bottom . 我认为在设计逻辑以将df_in分成3个部分时更容易: top, middle and bottom Keeping top and bottom intact while joining middle part. 在连接中间部分时保持顶部和底部完好无损。 Finally, concat 3 parts together into df_out 最后,将3个部分连接成df_out

First, create m1 and m2 masks to separate df_in to 3 parts. 首先,创建m1m2掩码以将df_in 3个部分。

m1 = df_in.notes.str.strip().str.contains(r'^\*+|\(\d+\)$').cummax()
m2 =  ~df_in.notes.str.strip().str.contains(r'^I don’t want this to be added$')
top = df_in[~m1].notes
middle = df_in[m1 & m2].notes
bottom = df_in[~m2].notes

Next, create groupby_mask to group rows and groupby and join : 接下来,创建groupby_mask以对行和groupby进行分组并join

groupby_mask = middle.str.strip().str.contains(r'^\*+|\(\d+\)$').cumsum()
middle_join = middle.groupby(groupby_mask).agg(' '.join)

Out[3110]:
notes
1                      * hello
2            ** my name is xyz
3        (1) this is temp name
4    (2) BTW how to solve this
5              (3) with python
Name: notes, dtype: object

Finally, use pd.concat to concat top , middle_join , bottom 最后,使用pd.concat连接topmiddle_joinbottom

df_final = pd.concat([top, middle_join, bottom], ignore_index=True).to_frame()

Out[3114]:
                            notes
0                             aaa
1                             bbb
2                         * hello
3               ** my name is xyz
4           (1) this is temp name
5       (2) BTW how to solve this
6                 (3) with python
7  I don’t want this to be added
8  I don’t want this to be added

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM