正则表达式搜索/替换具有 python pandas 的列

Question

The following is a small example of a.csv file I'm trying to do some data manipulation on.以下是我正在尝试对其进行一些数据操作的 a.csv 文件的一个小示例。 Each "comment" column has a column of it's own, separated by a semil colon ("date;user;comment").每个“comment”列都有自己的列，由半冒号（“date;user;comment”）分隔。 My goal is to prepend "gp-" to the user我的目标是在用户前面加上“gp-”

Original:原来的：

issue_key,summary,comment,comment,comment,comment,resolution
ABC-1234,summary1,"03/11/2021 12:18;user1;a text comment","03/10/2021 11:18;user2,a text comment",,,Unresolved
ABC-4321,summary2,"03/08/2021 12:10;user7;a text comment","03/10/2021 11:18;user5,a text comment",,,Unresolved
ABC-2214,summary3,"03/09/2021 12:20;user9;a text comment",,"03/10/2021 11:18;user3,a text comment",,Unresolved

What I'd like it to be transformed to:我希望将其转换为：

issue_key,summary,comment,comment,comment,comment,resolution
ABC-1234,summary1,"03/11/2021 12:18;gp-user1;a text comment","03/10/2021 11:18;gp-user2,a text comment",,,Unresolved
ABC-4321,summary2,"03/08/2021 12:10;gp-user7;a text comment","03/10/2021 11:18;gp-user5,a text comment",,,Unresolved
ABC-2214,summary3,"03/09/2021 12:20;gp-user9;a text comment",,"03/10/2021 11:18;gp-user3,a text comment",,Unresolved

The code I have so far.我到目前为止的代码。 I think I'm close'ish:我想我很接近：

with open(destination_filename) as f:
    orig_header = f.readline()
orig_header = orig_header.split(",")
orig_header[-1] = orig_header[-1].strip()
csv_data = pd.read_csv(destination_filename)
cols = csv_data.columns[csv_data.columns.str[:7]=='Comment']
csv_data[cols] = csv_data[cols].apply(lambda x: re.sub(r'(\d+\/\d+\/\d\d\d\d \d+:\d+);(\S+);(.*)', r'\1;gp-\2;\3', str(x)))
csv_data.to_csv(f"{destination_filename}", index = False, header=orig_header)

Answer 1

One approach would be just to use the built in csv library.一种方法是使用内置的csv库。 It can also be used to process the comment fields as ;它也可用于将评论字段处理为; separated csv rows.分隔 csv 行。

For example:例如：

import io
import csv

def replace_user(entry):
    if len(entry):
        values = next(csv.reader(io.StringIO(entry, newline=''), delimiter=';'))
        values[1] = f'gp-{values[1]}'
        entry = ';'.join(values)
    return entry


with open('input.csv', newline='') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    csv_output.writerow(next(csv_input)) # copy the header
    
    for row in csv_input:
        row[2:6] = [replace_user(v) for v in row[2:6]]
        csv_output.writerow(row)

Giving you an output.csv containing:给你一个output.csv包含：

issue_key,summary,comment,comment,comment,comment,resolution
ABC-1234,summary1,03/11/2021 12:18;gp-user1;a text comment,"03/10/2021 11:18;gp-user2,a text comment",,,Unresolved
ABC-4321,summary2,03/08/2021 12:10;gp-user7;a text comment,"03/10/2021 11:18;gp-user5,a text comment",,,Unresolved
ABC-2214,summary3,03/09/2021 12:20;gp-user9;a text comment,,"03/10/2021 11:18;gp-user3,a text comment",,Unresolved

If comments can also have quotes or newlines, an additional csv.writer() could be used instead of the join() .如果注释也可以有引号或换行符，则可以使用额外的csv.writer()代替join() 。

正则表达式搜索/替换具有 python pandas 的列

问题描述

1 个解决方案

解决方案1
0 2021-03-12 10:57:47

正则表达式搜索/替换具有 python pandas 的列

问题描述

1 个解决方案

解决方案1 0 2021-03-12 10:57:47

解决方案1
0 2021-03-12 10:57:47