繁体   English   中英

使用python字典查找/替换字符串的csv具有多个字符串,每个单元可替换

[英]Using python dictionary to find/replace strings is csv with multiple strings to replace per cell

请注意,这是我的原始查询的修订本/精炼本,希望比我的第一次尝试更加清晰。 我是编程领域的新手,试图创建一个脚本,该脚本基本上会进行一系列特定的查找,并使用另一个csv表作为更正指南在csv上进行替换。 (即chiken变成鸡肉,bcon变成培根)

所以在简单的情况下:
chikn,1,a
bcon,2,b
egs,3,c

变成
小鸡1,a
培根,2,b
鸡蛋3,c

到目前为止,使用下面的代码,我已经基于输入的csv构建了一个词典,并且能够按照简单情况中的预期转换目标csv上的大多数校正。 但是,真正的挑战是,实际的数据集通常每个单元格具有1-3个条目(它们之间的共同偏斜符号:),并且其中许多将带有空格(即,是短语而不是单个单词)。 在具有更新的词典的先前示例的基础上,这将是:

开始于:
chk三明治:egs,1,a
bcon,2,b
Bcon:egs,3,c

应该以:
三明治鸡肉:鸡蛋1,1
培根,b,2
培根:鸡蛋,3,c

相反,我当前的输出会删除后一部分并打印
三明治鸡肉1,a
培根,b,2
培根3,c

码:

#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.

"""

import csv
import sys


def main(args):
    """Execute the transformation script.

    Args:
        args (list of `str`): The command line arguments.

    """
    transform(args[1], args[2], create_reps(args[3]), int(args[4]))


def transform(infile, outfile, reps, column):
    """Write a new CSV file with replaced text.

    Args:
        infile (str): the sheet of original text with errors
        outfile (str): the sheet with the revised text with corrections in place of errors
        reps (:obj: `str`): dictionary of error word and corrected word
        column (int): the column (0 based) the word revisions will take place in

    """
    with open(infile) as csvfile:
        with open(outfile, 'w') as w:
            spamreader = csv.reader(csvfile)
            spamwriter = csv.writer(w)
            for row in spamreader:
                row[column] = replace_all(row[column], reps)
                spamwriter.writerow(row)


def create_reps(infile):
    """Create reps object to use as reference dictionary for transform.

    Args:
        infile (str): The sheet of original and corrected words used to
        generate dicitonary

    Returns:
        (:obj: `str`): a dictionary listing the error words and their
        corrections

    """
    reps = {}
    with open(infile) as csvfile:
        dictreader = csv.reader(csvfile)
        for row in dictreader:
            reps[row[0]] = row[1]

    return reps


# def replace_all(text, reps):
    #"""Original Version: Iterate through `reps` and replace key => value in `text`.

    # Args:
      #text (str): The text to search and replace.
   # reps (:obj: `str`): Search for `key` and replace with `value`

   # Returns:
     # (str): The string with the replacements.

    """
    # last = text
    # for i, j in reps.items():
     #   text = text.replace(i, j)
      #  if last != text:
       #     return text

def new_replace_all(text, reps):
    """Updated Version: Do a single-pass replacement from a dictionary"""
    pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
    return pattern.sub(lambda x: reps[x.group()], text)

if __name__ == "__main__":
    main(sys.argv)

预先感谢大家的时间和支持。 我期待您的指导!

最好。

----------------更新4/5/18 ---------------------------- ---------

有了HFBrowing的大力支持,我已经能够修改此代码以与最初提供的示例数据集一起使用。 但是,在我的实际应用程序中,我发现当暴露给数据集中某些更复杂的字符串匹配项时,它仍然崩溃。 我欢迎有关如何解决此问题的任何建议,并在下面提供了一些示例和错误。

理想情况下,给定单元格中的项目之间用“ |”链接 将保持在一起,并且在给定单元格中由“:”链接的项目将被视为单独的字符串并被单独替换。

因此,如果:
“ A | first” =“ A1”和“ B | first” =“ B1”
然后
“ A | first:B | first”应转换为“ A1:B1”。

使用这个更复杂的字符串数据,我提供了示例,预期的输出和当前的输出以及收到的错误代码。

样本字典
错误词,正确词。
精算学,会计:精算学。
人类学,人类学:一般。
未声明,未确定。
信息技术与行政管理|行政管理
专业,信息技术和行政
管理:行政管理专业化。
生物学,生物学。

样本输入
专业,ID,最后
精算科学,111,史密斯。
人类学,222,鲍勃。
人类学:精算科学,333,约翰逊。
信息技术与行政管理|行政管理专业,444,弗兰克。
555,未公开

当前输出错误:

    Traceback (most recent call last):  
  File "myscript3.py", line 89, in <module> . 
    main(sys.argv) . 
  File "myscript3.py", line 21, in main . 
    transform(args[1], args[2], create_reps(args[3]), int(args[4])) . 
  File "myscript3.py", line 41, in transform . 
    row[column] = new_replace_all(row[column], reps) . 
  File "myscript3.py", line 68, in new_replace_all . 
    return pattern.sub(lambda x: reps[x.group()], text)  
  File "myscript3.py", line 68, in <lambda> .   
    return pattern.sub(lambda x: reps[x.group()], text) .   
KeyError: 'Information Technology and Administrative Management' .  

电流输出csv
“少校,身份证,最后一位。
会计:精算学,111,Sumeri。
人类学:222,尼尔森将军。
人类学:一般;会计:精算学,333,纽曼。

-----------------------更新4/6/18:已解决------------------- -------

大家好,

谢谢大家的支持。 在一位同事的建议下,我将原始的“ Replace_all”代码修改如下。 现在,这似乎在我的上下文中按预期工作。

再次感谢您的时间和支持!

   #!/usr/bin/env python
"""A script for finding and replacing values in CSV files.

Example::
    ./myscript school-data.csv outfile-data.csv replacements.csv 4

"""

import csv
import sys


def main(args):
    """Execute the transformation script.

    Args:
        args (list of `str`): The command line arguments.

    """
    transform(args[1], args[2], create_reps(args[3]), int(args[4]))


def transform(infile, outfile, reps, column):
    """Write a new CSV file with replaced text.

    Args:
        infile (str): the sheet of original text with errors
        outfile (str): the sheet with the revised text with corrections in
            place of errors
        reps (:obj: `str`): dictionary of error word and corrected word
        column (int): the column (0 based) the word revisions will take place
            in

    """
    with open(infile) as csvfile:
        with open(outfile, 'w') as w:
            spamreader = csv.reader(csvfile)
            spamwriter = csv.writer(w)
            for row in spamreader:
                row[column] = replace_all(row[column], reps)
                spamwriter.writerow(row)


def create_reps(infile):
    """Create reps object to use as reference dictionary for transform.

    Args:
        infile (str): The sheet of original and corrected words used to
        generate dicitonary

    Returns:
        (:obj: `str`): a dictionary listing the error words and their
        corrections

    """
    reps = {}
    with open(infile) as csvfile:
        dictreader = csv.reader(csvfile)
        for row in dictreader:
            reps[row[0]] = row[1]

    return reps


def replace_all(text, reps):
    """Iterate through `reps` and replace key => value in `text`.

    Args:
      text (str): The text to search and replace.
    reps (:obj: `str`): Search for `key` and replace with `value`

    Returns:
      (str): The string with the replacements.

    """
    last = text
    for i, j in reps.items():
        text = text.replace(i, j)
        #if last != text:
        #    return text
    return text

if __name__ == "__main__":
    main(sys.argv)

实际上,我根本无法使您的代码示例完全能够正常工作来替换事物,因此,我确定与您正在执行的CSV结构相比,它们的结构有所不同。 不过,我认为问题出在您的replace_all()函数中,因为顺序替换文本可能很棘手 这是针对该链接问题的解决方案,已根据功能进行了调整。 这样可以为您解决问题吗?

def new_replace_all(text, reps):
    """Do a single-pass replacement from a dictionary"""
    pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
    return pattern.sub(lambda x: reps[x.group()], text)
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.

Example::
    ./myscript school-data.csv outfile-data.csv replacements.csv 4

"""

import csv
import sys


def main(args):
    """Execute the transformation script.

    Args:
        args (list of `str`): The command line arguments.

    """
    transform(args[1], args[2], create_reps(args[3]), int(args[4]))


def transform(infile, outfile, reps, column):
    """Write a new CSV file with replaced text.

    Args:
        infile (str): the sheet of original text with errors
        outfile (str): the sheet with the revised text with corrections in
            place of errors
        reps (:obj: `str`): dictionary of error word and corrected word
        column (int): the column (0 based) the word revisions will take place
            in

    """
    with open(infile) as csvfile:
        with open(outfile, 'w') as w:
            spamreader = csv.reader(csvfile)
            spamwriter = csv.writer(w)
            for row in spamreader:
                row[column] = replace_all(row[column], reps)
                spamwriter.writerow(row)


def create_reps(infile):
    """Create reps object to use as reference dictionary for transform.

    Args:
        infile (str): The sheet of original and corrected words used to
        generate dicitonary

    Returns:
        (:obj: `str`): a dictionary listing the error words and their
        corrections

    """
    reps = {}
    with open(infile) as csvfile:
        dictreader = csv.reader(csvfile)
        for row in dictreader:
            reps[row[0]] = row[1]

    return reps


def replace_all(text, reps):
    """Iterate through `reps` and replace key => value in `text`.

    Args:
      text (str): The text to search and replace.
    reps (:obj: `str`): Search for `key` and replace with `value`

    Returns:
      (str): The string with the replacements.

    """
    last = text
    for i, j in reps.items():
        text = text.replace(i, j)
        #if last != text:
        #    return text
    return text

if __name__ == "__main__":
    main(sys.argv)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM