简体   繁体   English

python的正则表达式更改一组字符

[英]regex for python to change a set of char

I have a file with Unicode characters with pattern like 我有一个带有Unicode字符且格式如下的文件

a unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥

Here '१', '२' these are not responding to the numerical query as those are Unicode characters. 这里的“ १”,“ २”不响应数字查询,因为它们是Unicode字符。 There is space between '॥' 。之间有空格 and '२'. 和“ २”。

Now there is no newline, no break. 现在没有换行,没有休息。 I want to have newline after every alternate '॥' 我想在每隔一个'。'之后加上换行符 so I could have pattern like 这样我就可以像

a unicode string1 । b unicode string2 ॥ १ ॥ 
c unicode string3 । d unicode string4 ॥ २ ॥

I tried few regex but could not achieve it with my poor knowledge of regex. 我尝试过很少的正则表达式,但由于我对正则表达式的了解不足而无法实现。 The sample of my code is, which provides a newline after every '॥', below. 我的代码示例为,在下面的每个“。”之后提供了换行符。

import csv

txt_file = "/path/to/file/file_name.txt"
csv_file = "mycsv.csv"

regex = "॥"

with open(txt_file,'r+') as fr, open('vc','r+') as fw:
    for line in fr:
        fw.write(line.replace(regex,  "॥\n"))

It is giving result like 它给像这样的结果

a unicode string1 । b unicode string2 ॥ 
१ ॥ 
c unicode string3 । d unicode string4 ॥ 
२ ॥

Welcome to the confusing world of regex... 欢迎来到令人困惑的正则表达式世界...

I suggest using the re library, which can easily handle what you want to do. 我建议使用re库,它可以轻松处理您想做的事情。 For example: 例如:

import re

text = "a unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥"

pattern = '(॥ .{1} ॥ )'

new = re.sub(pattern,
             lambda m: m.groups()[0][:-1] + '\n',
             text)
print(new)

>> a unicode string1 । b unicode string2 ॥ १ ॥ 
   c unicode string3 । d unicode string4 ॥ २ ॥

A bit of explanation: 一点解释:

  1. pattern is a regular expression defining the '॥ pattern是定义'。的正则表达式 [any character] ॥' [任何字符]。 pattern you want to place a newline after. 您想在其后放置换行符的模式。 The .{1} means 'any single character', and I've left a space after the second .{1}意思是“任何单个字符”,第二个之后我留了一个空格 that the \\n is added after the space, and it doesn't hang around at the start of the next line. \\n空格添加,并且不会在下一行的开头徘徊。 The whole pattern is placed in brackets, identifying it as a single regex 'group'. 整个模式放在方括号中,将其标识为单个正则表达式“组”。
  2. This pattern is used in re.sub, which replaces all instances of it in the given string. 此模式用在re.sub中,它将替换给定字符串中的所有实例。 In this case, you want to replace it with what was originally there, plus a newline marker. 在这种情况下,您想将其替换为原来的内容,再加上换行标记。 This happens in the lambda function. 这在lambda函数中发生。
  3. The lambda function replaces the matched group with itself ( m.groups()[0] ), after trimming off the trailing space ( [:-1] ), and adding a newline character ( +\\n ) 在剪裁尾随空格( [:-1] )并添加换行符( +\\n )之后,lambda函数用其自身替换匹配的组( m.groups()[0] )。

There might be a simpler way of doing this that doesn't involve using groups... but this works! 可能有一种更简单的方法,该方法不涉及使用组...但这是可行的!

This is because it is finding each instance of " ॥ " and then putting a new line after it. 这是因为它正在查找“”的每个实例,然后在其后放置新行。 You may want to rewrite your loop to find a more specific example. 您可能需要重写循环以找到更具体的示例。

regex = '॥ १ ॥'
txt_file = open("newTextFile.txt", "r")

rawFileString=txt_file.read()
rawFileString=rawFileString.replace(regex,'॥ १ ॥\n')


print(rawFileString)

And from here you can get new lines, and write this string to a new file etc. 从这里您可以换行,并将此字符串写入新文件等。

Note: this will work because there is a pattern in your text file. 注意:这将起作用,因为文本文件中有一个模式。 If you have something more complicated you may need to do several replacements or other modifications to the text to retrieve the result you want. 如果您有更复杂的内容,则可能需要对文本进行多次替换或其他修改才能检索所需的结果。

Edit: Although this method can get messy, you can avoid using very complicated regex and create a substring from the index of the find instance of a delimiter. 编辑:尽管此方法可能会变得凌乱,但您可以避免使用非常复杂的正则表达式,并从定界符的find实例的索引创建子字符串。

The way your file looks to be patterned this may work for you: 您的文件看起来有图案的方式可能对您有用:

txt_file = open("newTextFile.txt", "r")

rawFileString=txt_file.read()


startOfText = 0
delimiter = '॥'


instance1= rawFileString.find(delimiter)
#print rawFileString.find(delimiter)

instance2= rawFileString.find(delimiter, instance1+1)
#print rawFileString.find(delimiter,instance1+1)

counter=0  

#for this while loop you may want to change 10 to be the number of lines in the document multiplied by 2.

while counter<10:
        substring=rawFileString[startOfText:instance2+3]  
        print(substring)
        startOfText = instance2+4 
        instance1 = rawFileString.find(delimiter, startOfText)
        instance2 = rawFileString.find(delimiter, instance1+1)
        counter=counter+1
txt_file.close()

There is also another way to solve, by considering the fact the "॥ ", followed by an alphabet character is always the case for a new line insertion. 还有一种解决方法,通过考虑以下事实:换行插入始终是“ character”,后跟字母字符的情况。

s = r'unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥'
occurrences = re.split(r'॥ [a-z]{1,}', s)
for item in occurrences[:-1]:
        print item.strip()+" ॥"
print occurrences[:-1].strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM