简体   繁体   English

在文件Python中进行条件搜索和替换

[英]Conditional Search and Replace in a file Python

I have a large text file over 10MB when needs to have conditional search and replace. 需要条件搜索和替换时,我有一个超过10MB的大型文本文件。 I want to replace every instance of "a" inside the file with "ā" if the character after "a" is either "r" or "m" or "n" or "u". 如果“ a”之后的字符是“ r”或“ m”或“ n”或“ u”,我想用“ā”替换文件中“ a”的每个实例。

For example: Input file 例如:输入文件

Hamro sano ghar holata.

Output file 输出文件

Hāmro sāno ghār holata.

EDIT 编辑

Thanks guys, it seems to work well. 谢谢大家,它似乎运作良好。 But it doesn't seem to work with non-latin characters like Indic Scripts: Working script for latin chars: 但这似乎不适用于非拉丁字符,例如印度脚本:拉丁字符的工作脚本:

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import re
input = "Hamro sano ghar holata."
regex = re.compile(ur'a([rmnu])')
print regex.sub(ur'ā\1', input)

Script1 (for Devanagari) NOT WORKING 脚本1(用于梵文)不起作用

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import re
input ="संगम"
regex = re.compile(ur'ं([कखगघ])')
print regex.sub(r'ङ्\1', input)

Script2 (added unicode stuff) NOT WORKING Script2(添加了unicode的东西)不起作用

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import re
input =u"संगम"
regex = re.compile(ur'ं([कखगघ])', re.UNICODE)
print regex.sub(r'ङ्\1', input)

Expected output: ं replaced with ङ् as ग follows ं that ie सङ्गम 预期输出:ं替换为ङ्,因为ग跟随ं即सङ्गम

you need a simple regular expression here. 您需要在此处输入一个简单的正则表达式。 Something like this? 像这样吗

>>> import re
>>> input = "Hamro sano ghar holata."
>>> regex = re.compile(ur'a([rmnu])') # the part in parens is remembered
>>> print regex.sub(ur'ā\1', input) # replace by ā plus remembered part
Hāmro sāno ghār holata.

Edit: 编辑:

some background, first: 一些背景,首先:

This is a much tougher task with Devanāgarī (देवनागरी), not because of the encoding, but because the rules for combining the glyphs are extremely complicated (at least, by the standards of latin script). 这是梵文一个艰巨的任务(देवनागरी),并不是因为编码的,但因为结合的字形规则是非常复杂的(至少,由拉丁字母的标准)。 I'm writing this answer on Chrome, for example, which still can't compose the Devanāgarī for "Devanāgarī" correctly (it gets the diacritical mark for 'e' in the wrong place -- it does the same with the dipthong 'ai'). 例如,我在Chrome上写了这个答案,但仍然无法正确将Devanāgarī改成 “Devanāgarī”(它在错误的位置得到了'e'的变音符号-与dipthong'ai的用法相同')。

The ways these glyphs are combined by a text rendering engine are called 'ligatures', and for Devanāgarī they're very complicated, from a technical point of view. 这些字形由文本渲染引擎组合的方式称为“连字”,对于Devanāgarī ,从技术角度来看,它们非常复杂。 If you add the further enormous complications introduced by संधि ( saṃdhi -- again, Chrome's rendering gets the bindu that represents the anusvāra in the wrong place), then you can see that what you're trying to do here can quickly get extremely difficult. 如果添加的संधि(saṃdhi -再次,Chrome的渲染变得代表放错了地方的随韵宾度 )推出的进一步巨大的并发症,那么你就可以看到你想在这里做可以迅速得到极其困难的。

Having said all that, if your problem is limited to this simple case, then I think it can be done cleanly. 综上所述,如果您的问题仅限于这种简单情况,那么我认为可以完全解决。

>>> import re
>>> inputString = u"संगम"
>>> regex = re.compile(ur'\u0902(?=[कखगघ])')
>>> print regex.sub(ur'ङ\u094d', inputString)
सङ्गम

In the regexes I've replaced the anusvāra and the virāma (Hindi: halant ) with the unicode escaped value, for clarity. 在正则表达式中,为清楚起见,我用unicode转义值替换了anusvāravirāma (印地语: halant )。 Given the way the ligatures work, it's possible this will miss some cases, but I've switched my example to using the lookahead, as in @Kabie's example (which is probably a better choice anyway), to mitigate this as far as possible. 考虑到连字的工作方式,可能会遗漏某些情况,但是我已将示例转换为使用先行方式,如@Kabie的示例(无论如何,这可能是一个更好的选择),以尽可能地减轻这种情况。

re.sub(r'a(?=[rmnu])',r'ā',"Hamro sano ghar holata.")

For your large text file, you should copy the original, replace the characters, and write a new file with the updated lines. 对于较大的文本文件,应复制原始文件,替换字符,然后使用更新的行编写新文件。 You should read just a chunk at a time, not the whole file. 您一次只能读取一个块,而不是整个文件。 (Although on a modern computer you could just slurp the whole 10 MB in one go.) (尽管在一台现代计算机上,您可以一次吞噬整个10 MB。)

An easy way to do this is to use the file object as an iterator; 一种简单的方法是将文件对象用作迭代器。 this returns one line from the file at a time. 这一次从文件返回一行。

import re
pat = re.compile(ur'a([rmnu])') # pre-compile regex pattern for speed

f = open("corrected_file.txt", "wb")

for line in open("big_file_10mb.txt", "rb"):
    line = pat.sub(ur'ā\1', line)
    f.write(line)

f.close()

If you wanted to slurp the whole file in one go, you can use the .read() method function: 如果您想一次性处理整个文件,则可以使用.read()方法函数:

f = open("big_file_10mb.txt", "rb")
s = f.read()  # read entire file contents
f.close()
s = pat.sub(ur'ā\1', s)  # replace over entire file contents
f = open("corrected_file.txt", "wb")
f.write(s)  # write entire file contents
f.close(s)

Don't do it this way unless you have a good reason. 除非您有充分的理由,否则请不要这样做。 The line-oriented version is easy to understand and works much better when files are large compared to the memory available on your computer. 面向行的版本易于理解,并且与计算机上可用的内存相比,当文件较大时,效果更好。

The book Dive Into Python has a chapter explaining regular expressions: Dive Into Python一书中有一章介绍了正则表达式:

http://diveintopython3.ep.io/regular-expressions.html http://diveintopython3.ep.io/regular-expressions.html

You want to read Unicode and replace Unicode characters. 您要阅读Unicode并替换Unicode字符。 You will need to figure out the native encoding of the file, read it in, convert to Unicode, do the substitution, then write it out in the proper encoding. 您将需要弄清楚文件的本机编码,将其读入,转换为Unicode,进行替换,然后以正确的编码写出。 Or you can use the special "codecs" module; 或者,您可以使用特殊的“编解码器”模块; codecs.open() will give you a file object that automatically converts for you. codecs.open()将为您提供一个自动为您转换的文件对象。

Here is the Unicode "how-to" document for Python: 这是Python的Unicode“操作方法”文档:

http://docs.python.org/howto/unicode.html http://docs.python.org/howto/unicode.html

So, let's assume that the text file you want to read is encoded in UTF-8. 因此,假设您要读取的文本文件是使用UTF-8编码的。 I think this will work for you: 我认为这将为您工作:

import codecs
import re

pat = re.compile(ur'a([rmnu])') # pre-compile regex pattern for speed

f = codecs.open("corrected_file.txt", mode="wb", encoding="utf-8")

for line in codecs.open("big_file_10mb.txt", mode="rb", encoding="utf-8"):
    line = pat.sub(ur'ā\1', line)
    f.write(line)

f.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM