[英]Using python dictionary to find/replace strings is csv with multiple strings to replace per cell
请注意,这是我的原始查询的修订本/精炼本,希望比我的第一次尝试更加清晰。 我是编程领域的新手,试图创建一个脚本,该脚本基本上会进行一系列特定的查找,并使用另一个csv表作为更正指南在csv上进行替换。 (即chiken变成鸡肉,bcon变成培根)
所以在简单的情况下:
chikn,1,a
bcon,2,b
egs,3,c
变成
小鸡1,a
培根,2,b
鸡蛋3,c
到目前为止,使用下面的代码,我已经基于输入的csv构建了一个词典,并且能够按照简单情况中的预期转换目标csv上的大多数校正。 但是,真正的挑战是,实际的数据集通常每个单元格具有1-3个条目(它们之间的共同偏斜符号:),并且其中许多将带有空格(即,是短语而不是单个单词)。 在具有更新的词典的先前示例的基础上,这将是:
开始于:
chk三明治:egs,1,a
bcon,2,b
Bcon:egs,3,c
应该以:
三明治鸡肉:鸡蛋1,1
培根,b,2
培根:鸡蛋,3,c
相反,我当前的输出会删除后一部分并打印
三明治鸡肉1,a
培根,b,2
培根3,c
码:
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
# def replace_all(text, reps):
#"""Original Version: Iterate through `reps` and replace key => value in `text`.
# Args:
#text (str): The text to search and replace.
# reps (:obj: `str`): Search for `key` and replace with `value`
# Returns:
# (str): The string with the replacements.
"""
# last = text
# for i, j in reps.items():
# text = text.replace(i, j)
# if last != text:
# return text
def new_replace_all(text, reps):
"""Updated Version: Do a single-pass replacement from a dictionary"""
pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
return pattern.sub(lambda x: reps[x.group()], text)
if __name__ == "__main__":
main(sys.argv)
预先感谢大家的时间和支持。 我期待您的指导!
最好。
----------------更新4/5/18 ---------------------------- ---------
有了HFBrowing的大力支持,我已经能够修改此代码以与最初提供的示例数据集一起使用。 但是,在我的实际应用程序中,我发现当暴露给数据集中某些更复杂的字符串匹配项时,它仍然崩溃。 我欢迎有关如何解决此问题的任何建议,并在下面提供了一些示例和错误。
理想情况下,给定单元格中的项目之间用“ |”链接 将保持在一起,并且在给定单元格中由“:”链接的项目将被视为单独的字符串并被单独替换。
因此,如果:
“ A | first” =“ A1”和“ B | first” =“ B1”
然后
“ A | first:B | first”应转换为“ A1:B1”。
使用这个更复杂的字符串数据,我提供了示例,预期的输出和当前的输出以及收到的错误代码。
样本字典 。
错误词,正确词。
精算学,会计:精算学。
人类学,人类学:一般。
未声明,未确定。
信息技术与行政管理|行政管理
专业,信息技术和行政
管理:行政管理专业化。
生物学,生物学。
样本输入 。
专业,ID,最后
精算科学,111,史密斯。
人类学,222,鲍勃。
人类学:精算科学,333,约翰逊。
信息技术与行政管理|行政管理专业,444,弗兰克。
555,未公开
当前输出错误:
Traceback (most recent call last):
File "myscript3.py", line 89, in <module> .
main(sys.argv) .
File "myscript3.py", line 21, in main .
transform(args[1], args[2], create_reps(args[3]), int(args[4])) .
File "myscript3.py", line 41, in transform .
row[column] = new_replace_all(row[column], reps) .
File "myscript3.py", line 68, in new_replace_all .
return pattern.sub(lambda x: reps[x.group()], text)
File "myscript3.py", line 68, in <lambda> .
return pattern.sub(lambda x: reps[x.group()], text) .
KeyError: 'Information Technology and Administrative Management' .
电流输出csv 。
“少校,身份证,最后一位。
会计:精算学,111,Sumeri。
人类学:222,尼尔森将军。
人类学:一般;会计:精算学,333,纽曼。 ”
-----------------------更新4/6/18:已解决------------------- -------
大家好,
谢谢大家的支持。 在一位同事的建议下,我将原始的“ Replace_all”代码修改如下。 现在,这似乎在我的上下文中按预期工作。
再次感谢您的时间和支持!
码
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
Example::
./myscript school-data.csv outfile-data.csv replacements.csv 4
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in
place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place
in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
def replace_all(text, reps):
"""Iterate through `reps` and replace key => value in `text`.
Args:
text (str): The text to search and replace.
reps (:obj: `str`): Search for `key` and replace with `value`
Returns:
(str): The string with the replacements.
"""
last = text
for i, j in reps.items():
text = text.replace(i, j)
#if last != text:
# return text
return text
if __name__ == "__main__":
main(sys.argv)
实际上,我根本无法使您的代码示例完全能够正常工作来替换事物,因此,我确定与您正在执行的CSV结构相比,它们的结构有所不同。 不过,我认为问题出在您的replace_all()
函数中,因为顺序替换文本可能很棘手 。 这是针对该链接问题的解决方案,已根据功能进行了调整。 这样可以为您解决问题吗?
def new_replace_all(text, reps):
"""Do a single-pass replacement from a dictionary"""
pattern = re.compile(r'\b(' + '|'.join(reps.keys()) + r')\b')
return pattern.sub(lambda x: reps[x.group()], text)
#!/usr/bin/env python
"""A script for finding and replacing values in CSV files.
Example::
./myscript school-data.csv outfile-data.csv replacements.csv 4
"""
import csv
import sys
def main(args):
"""Execute the transformation script.
Args:
args (list of `str`): The command line arguments.
"""
transform(args[1], args[2], create_reps(args[3]), int(args[4]))
def transform(infile, outfile, reps, column):
"""Write a new CSV file with replaced text.
Args:
infile (str): the sheet of original text with errors
outfile (str): the sheet with the revised text with corrections in
place of errors
reps (:obj: `str`): dictionary of error word and corrected word
column (int): the column (0 based) the word revisions will take place
in
"""
with open(infile) as csvfile:
with open(outfile, 'w') as w:
spamreader = csv.reader(csvfile)
spamwriter = csv.writer(w)
for row in spamreader:
row[column] = replace_all(row[column], reps)
spamwriter.writerow(row)
def create_reps(infile):
"""Create reps object to use as reference dictionary for transform.
Args:
infile (str): The sheet of original and corrected words used to
generate dicitonary
Returns:
(:obj: `str`): a dictionary listing the error words and their
corrections
"""
reps = {}
with open(infile) as csvfile:
dictreader = csv.reader(csvfile)
for row in dictreader:
reps[row[0]] = row[1]
return reps
def replace_all(text, reps):
"""Iterate through `reps` and replace key => value in `text`.
Args:
text (str): The text to search and replace.
reps (:obj: `str`): Search for `key` and replace with `value`
Returns:
(str): The string with the replacements.
"""
last = text
for i, j in reps.items():
text = text.replace(i, j)
#if last != text:
# return text
return text
if __name__ == "__main__":
main(sys.argv)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.