简体   繁体   English

Python字符串文字,正则表达式和sed

[英]Python string literals, regex and sed

I am new to Python. 我是Python的新手。

I am importing a series of files into a sqlite3 database using a python script. 我正在使用python脚本将一系列文件导入sqlite3数据库。 Some of the raw files have spurious ^M characters that split the records into multiple lines. 一些原始文件具有伪造的^M字符,这些字符将记录分成多行。

The following sed command correctly removes the ^M and joins the two lines, creating a valid record. 下面的sed命令正确删除^M并将两行合并,从而创建有效记录。

sed -i '/^M^M$/ {s/^M//g;N;s/\\n//};' <file>

The ^M in the above are created with the CTRL+V CTRL+M sequence. 上面的^M是使用CTRL+V CTRL+M序列创建的。

The Python lines for the sed call are: sed调用的Python行是:

cmd = "sed -i '/\^M\^M$/ {s/\^M//g; N; s/\n////g; };' %s" % (file)
os.system(cmd)

I have attempted a variety of escape sequences (including the triple ''') in Python and get parsing errors, including unterminated address regex , unterminated 's' command and unknown option to 's' , and without escaping the ^M I get a hard-stop parsing error of SyntaxError: EOL while scanning string literal 我在Python中尝试了各种转义序列(包括三元组“'')并获得解析错误,包括unterminated address regexunterminated 's' commandunknown option to 's' ,而没有转义^M我得到了SyntaxError: EOL while scanning string literalSyntaxError: EOL while scanning string literal硬停止分析错误

How can I either 我怎么能

a) Encode the sed call so that it will execute properly when called with os.system(cmd) a)对sed调用进行编码,以便在使用os.system(cmd)调用时可以正确执行

or 要么

b) Perform the equivalent substitution in python directly (probably preferable, but I would want to be able to perform multiple types of corrections in one pass, not one pass per correction type). b)直接在python中执行等效替换(可能是更好的选择,但我希望能够一次通过多种校正类型,而不是每种校正类型一次通过)。

Thank you. 谢谢。

^M character is Carriage Return (CR) . ^M字符为Carriage Return (CR) It's the '\\r' character in python. 这是python中的'\\r'字符。

So, I guess, this should work fine: 因此,我想这应该可以正常工作:

cmd = "sed -i '/\r\r$/ {s/\r//g; N; s/\\n////g; };' %s" % (file)
os.system(cmd)

It would be much easier, particularly since you say you have multiple substitutions to perform, to do this entirely in Python. 完全用Python做到这一点要容易得多,特别是因为您说要执行多个替换。 The carriage return character is "\\r" . 回车符为"\\r"

Untested code for the task is as follows: 该任务的未经测试的代码如下:

replacements = (("\r", ""),
                ("one", "two"),
                ("three", "four"))
with open(filename, "r") as fin, open(filename+".new", "w") as fout:
    data = fin.read()
    for t1, t2 in replacements:
        data = data.replace(t1, t2)
    fout.write(data)

It then remains as an exercise for the reader to rename the output file to overwrite the input file. 然后,作为练习,读者可以重命名输出文件以覆盖输入文件。 note, by the way, that this code is explicitly designed to work with text files. 请注意,顺便说一句,该代码是专门设计用于文本文件的。 In Python 3 that would make a difference. 在Python 3中会有所作为。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM