简体   繁体   English

替换CSV文件Python中的模式

[英]Replace a pattern in CSV file Python

I have multiple CSV files that could represent similar things in multiple ways. 我有多个CSV文件,可以用多种方式表示相似的内容。 For instance, 15 years can be written either as age: 15, age (years): 15, age: 15 years (these are all the patterns I've seen till now). 例如,可以将15岁写成以下年龄:15岁(年龄):15岁:15岁(这些是我到目前为止所看到的所有模式)。 I'd like to replace all those with 15 years. 我想用15年取代所有这些。 I know how to do it when I know the actual age or the column number, but the age is definitely different for each occurrence and the column is not fixed. 当我知道实际年龄或列号时,我知道该怎么做,但是对于每次出现,年龄肯定是不同的,并且列不是固定的。 The csv files could be like below: csv文件可能如下所示:

CSV1: CSV1:

h1,h2,h3
A1,age:15,hh
B3,age:10,fg

Desired CSV1 所需的CSV1

h1,h2,h3
A1,15 years,hh
B3,10 years,fg

When ever its just age: 15, its definitely years and not months or any other unit. 只要年龄是15岁,绝对是年而不是数月或任何其他单位。

Use re.sub like below, 如下所示使用re.sub

re.sub(r'(,|^)(?:age\s*(?:\(years\))?:\s*(\d+)\s*(?:years)?)(?=,|$)',
       r'\1\2 years', string)

DEMO 演示

Example: 例:

import re
import csv
with open('file') as f:
    reader = csv.reader(f)
    for i in reader:
        print(re.sub(r'(,|^)(?:age\s*(?:\(years\))?:\s*(\d+)\s*(?:years)?)(?=,|$)', r'\1\2 years', ','.join(i)))

Output: 输出:

h1,h2,h3
A1,15 years,hh
B3,10 years,fg

OR 要么

for i in reader:
    print(re.sub(r'(,|^)[^,\n]*age\s*:[^,\n]*\b(\d+)\b[^,\n]*', r'\1\2 years', ','.join(i)))

Use the translate table methods in the string module. 在字符串模块中使用转换表方法。

import csv
from string import maketrans
from string import ascii_uppercase, ascii_lowercase
delete = ascii_uppercase + ascii_lowercase + ":"
tran = maketrans("", "")

with open("infile.csv", "rb") as infile, open("output.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        #assuming the second field here
        row[1] = row[1].translate(tran, delete) + " years"
        writer.writerow(row)

I generally prefer string.translate over regex where applicable as it's easier to follow and debug. 我通常更喜欢使用string.translate不是regex,因为它更易于跟踪和调试。

Its a guessing game, but if the rule is that you want to convert anything that has the word "year" and some decimal number, this should do. 它是一个猜谜游戏,但是如果规则是要转换具有单词“ year”和某个十进制数字的任何内容,则应该这样做。

import re

_is_age_search = re.compile(r"year|age", re.IGNORECASE).search
_find_num_search = re.compile(r"(\d+)").search

outdir = '/some/dir'
for filename in csv_filenames:
    with open(filename) as f_in, open(os.path.join(outdir, filename), 'w') as f_out:
        writer = csv.writer(f_out)
        for row in csv.reader(f_in):
            for i, val in enumerate(row):
                if _is_age_search(val):
                    search = _find_num_search(val)
                    if search:
                        row[i] = "%d years" % search.groups()
            writer.writerow(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM