根據條件合並Dataframe的行

Question

我有一個只有一列“注釋”的csv文件。 我想根據某些條件合並數據幀的行。

Input_data={'notes':
            ['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']}

df_in = pd.DataFrame(Input_data)

輸入看起來像這樣

產量

output_Data={'notes':
             ['aaa','bbb','*hello','**my name is xyz',
              '(1) this is temp name',
              '(2) BTW how to solve this',
              '(3) with python','I don’t want this to be added ',
              'I don’t want this to be added ']}
df_out=pd.DataFrame(output_Data)

我想將行與上面的行合並，其中包含"*"或"(number)" 。 所以輸出看起來像

應該保留其他無法合並的行。 此外，在最后一行的情況下，由於沒有正確的方法來了解我們可以合並的范圍，我們可以說只添加一個下一行我解決了這個但很長。 任何更簡單的方法

df=pd.DataFrame(Input_data)
notes=[];temp=[];flag='';value='';c=0;chk_star='yes'
for i,row in df.iterrows():
    row[0]=str(row[0])
    if '*' in row[0].strip()[:5] and chk_star=='yes':   
        value=row[0].strip()
        temp=temp+[value]
        value=''
        continue

    if '(' in row[0].strip()[:5]:
        chk_star='no'
        temp=temp+[value]
        value='';c=0
        flag='continue'
        value=row[0].strip()
    if flag=='continue' and '(' not in row[0][:5] : 
        value=value+row[0]
        c=c+1
    if c>4:
        temp=temp+[value] 
        print "111",value,temp
        break
if '' in temp:
    temp.remove('')
df=pd.DataFrame({'notes':temp})

Answer 1

您可以使用掩碼來避免for循環：

df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
         '(1)','this is ','temp ','name',
         '(2)','BTW ','how to ','solve this',
         '(3)','with python ','I don’t want this to be added ',
         'I don’t want this to be added ']})

special = ['*', '**']
for i in range(11):
    special.append('({})'.format(i))

# We find the indexes where we will have to merge
index_to_merge = df[df['row'].isin(special)].index.values
for idx, val in enumerate(index_to_merge):
    if idx != len(index_to_merge)-1:
        df.loc[val, 'row'] += ' ' + df.loc[val+1:index_to_merge[idx+1]-1, 'row'].values.sum()
    else:
        df.loc[index, 'row'] += ' ' + df.loc[index+1:, 'row'].values.sum()

# We delete the rows that we just used to merge
df.drop([x for x in np.array(range(len(df))) if x not in index_to_merge])

出：

        row
2   * hello
4   ** my nameis xyz
7   (1) this is temp name
11  (2) BTW how to solve this
15  (3) with python I don’t want this to be added ..

您還可以將列轉換為numpy數組，並使用numpy函數來簡化您的操作。 首先，您可以使用np.where和np.isin來查找必須合並的索引。 這樣您就不必使用for循環遍歷整個數組。

然后你可以在相應的索引上做mergures。 最后，您可以刪除已合並的值。 這是它的樣子：

list_to_merge = np.array(['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added '])
special = ['*', '**']
for i in range(11):
    special.append('({})'.format(i))

ix = np.isin(list_to_merge, special)
rows_to_merge = np.where(ix)[0]

# We merge the rows
for index_to_merge in np.where(ix)[0]:
    # Check if there we are not trying to merge with an out of bounds value
    if index_to_merge!=len(list_to_merge)-1:
        list_to_merge[index_to_merge] = list_to_merge[index_to_merge] + ' ' + list_to_merge[index_to_merge+1]

# We delete the rows that have just been used to merge:
rows_to_delete = rows_to_merge +1
list_to_merge = np.delete(list_to_merge, rows_to_delete)

出：

['aaa', 'bbb', '* hello', '** my name', 'is xyz', '(1) this is',
       'temp', 'name', '(2) BTW', 'how to', 'solve this',
       '(3) with python', 'I don’t want this to be added ',
       'I don’t want this to be added ']

Answer 2

下面的解決方案在句子的開頭識別特殊字符，如*，**和（數字），並開始合並除最后一行之外的后續行。

import pandas as pd
import re
df = pd.DataFrame({'row':['aaa','bbb','*','hello','**','my name','is xyz',
             '(1)','this is','temp','name',
             '(2)','BTW','how to','solve this',
             '(3)','with python','I don’t want this to be added ',
             'I don’t want this to be added ']})



pattern = "^\(\d+\)|^\*+" #Pattern to identify string starting with (number),*,**.

#print(df)
#Selecting index based on the above pattern
selected_index = df[df["row"].str.contains(re.compile(pattern))].index.values
delete_index = []
for index in selected_index:
    i=1
    #Merging row until next selected index found and add merged rows to delete_index list
    while(index+i not in selected_index and index+i < len(df)-1):
        df.at[index, 'row'] += ' ' + df.at[index+i, 'row']
        delete_index.append(index+i)
        i+=1


df.drop(delete_index,inplace=True)
#print(df)

輸出：

    row
0   aaa
1   bbb
2   *hello
4   **my nameis xyz
7   (1)this istempname
11  (2)BTWhow tosolve this
15  (3)with pythonI don’t want this to be added
18  I don’t want this to be added

您可以根據需要重置索引。 使用df.reset_index（）

Answer 3

我認為在設計邏輯以將df_in分成3個部分時更容易： top, middle and bottom 。 在連接中間部分時保持頂部和底部完好無損。 最后，將3個部分連接成df_out

首先，創建m1和m2掩碼以將df_in 3個部分。

m1 = df_in.notes.str.strip().str.contains(r'^\*+|\(\d+\)$').cummax()
m2 =  ~df_in.notes.str.strip().str.contains(r'^I don’t want this to be added$')
top = df_in[~m1].notes
middle = df_in[m1 & m2].notes
bottom = df_in[~m2].notes

接下來，創建groupby_mask以對行和groupby進行分組並join ：

groupby_mask = middle.str.strip().str.contains(r'^\*+|\(\d+\)$').cumsum()
middle_join = middle.groupby(groupby_mask).agg(' '.join)

Out[3110]:
notes
1                      * hello
2            ** my name is xyz
3        (1) this is temp name
4    (2) BTW how to solve this
5              (3) with python
Name: notes, dtype: object

最后，使用pd.concat連接top ， middle_join ， bottom

df_final = pd.concat([top, middle_join, bottom], ignore_index=True).to_frame()

Out[3114]:
                            notes
0                             aaa
1                             bbb
2                         * hello
3               ** my name is xyz
4           (1) this is temp name
5       (2) BTW how to solve this
6                 (3) with python
7  I don’t want this to be added
8  I don’t want this to be added

根據條件合並Dataframe的行

問題描述

3 個解決方案

解決方案1
0 2019-05-31 09:45:38

解決方案2
0 2019-05-31 10:56:27

解決方案3
0 2019-05-31 23:03:06

根據條件合並Dataframe的行

問題描述

3 個解決方案

解決方案1 0 2019-05-31 09:45:38

解決方案2 0 2019-05-31 10:56:27

解決方案3 0 2019-05-31 23:03:06

解決方案1
0 2019-05-31 09:45:38

解決方案2
0 2019-05-31 10:56:27

解決方案3
0 2019-05-31 23:03:06