简体   繁体   中英

regex for python to change a set of char

I have a file with Unicode characters with pattern like

a unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥

Here '१', '२' these are not responding to the numerical query as those are Unicode characters. There is space between '॥' and '२'.

Now there is no newline, no break. I want to have newline after every alternate '॥' so I could have pattern like

a unicode string1 । b unicode string2 ॥ १ ॥ 
c unicode string3 । d unicode string4 ॥ २ ॥

I tried few regex but could not achieve it with my poor knowledge of regex. The sample of my code is, which provides a newline after every '॥', below.

import csv

txt_file = "/path/to/file/file_name.txt"
csv_file = "mycsv.csv"

regex = "॥"

with open(txt_file,'r+') as fr, open('vc','r+') as fw:
    for line in fr:
        fw.write(line.replace(regex,  "॥\n"))

It is giving result like

a unicode string1 । b unicode string2 ॥ 
१ ॥ 
c unicode string3 । d unicode string4 ॥ 
२ ॥

Welcome to the confusing world of regex...

I suggest using the re library, which can easily handle what you want to do. For example:

import re

text = "a unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥"

pattern = '(॥ .{1} ॥ )'

new = re.sub(pattern,
             lambda m: m.groups()[0][:-1] + '\n',
             text)
print(new)

>> a unicode string1 । b unicode string2 ॥ १ ॥ 
   c unicode string3 । d unicode string4 ॥ २ ॥

A bit of explanation:

  1. pattern is a regular expression defining the '॥ [any character] ॥' pattern you want to place a newline after. The .{1} means 'any single character', and I've left a space after the second that the \\n is added after the space, and it doesn't hang around at the start of the next line. The whole pattern is placed in brackets, identifying it as a single regex 'group'.
  2. This pattern is used in re.sub, which replaces all instances of it in the given string. In this case, you want to replace it with what was originally there, plus a newline marker. This happens in the lambda function.
  3. The lambda function replaces the matched group with itself ( m.groups()[0] ), after trimming off the trailing space ( [:-1] ), and adding a newline character ( +\\n )

There might be a simpler way of doing this that doesn't involve using groups... but this works!

This is because it is finding each instance of " ॥ " and then putting a new line after it. You may want to rewrite your loop to find a more specific example.

regex = '॥ १ ॥'
txt_file = open("newTextFile.txt", "r")

rawFileString=txt_file.read()
rawFileString=rawFileString.replace(regex,'॥ १ ॥\n')


print(rawFileString)

And from here you can get new lines, and write this string to a new file etc.

Note: this will work because there is a pattern in your text file. If you have something more complicated you may need to do several replacements or other modifications to the text to retrieve the result you want.

Edit: Although this method can get messy, you can avoid using very complicated regex and create a substring from the index of the find instance of a delimiter.

The way your file looks to be patterned this may work for you:

txt_file = open("newTextFile.txt", "r")

rawFileString=txt_file.read()


startOfText = 0
delimiter = '॥'


instance1= rawFileString.find(delimiter)
#print rawFileString.find(delimiter)

instance2= rawFileString.find(delimiter, instance1+1)
#print rawFileString.find(delimiter,instance1+1)

counter=0  

#for this while loop you may want to change 10 to be the number of lines in the document multiplied by 2.

while counter<10:
        substring=rawFileString[startOfText:instance2+3]  
        print(substring)
        startOfText = instance2+4 
        instance1 = rawFileString.find(delimiter, startOfText)
        instance2 = rawFileString.find(delimiter, instance1+1)
        counter=counter+1
txt_file.close()

There is also another way to solve, by considering the fact the "॥ ", followed by an alphabet character is always the case for a new line insertion.

s = r'unicode string1 । b unicode string2 ॥ १ ॥ c unicode string3 । d unicode string4 ॥ २ ॥'
occurrences = re.split(r'॥ [a-z]{1,}', s)
for item in occurrences[:-1]:
        print item.strip()+" ॥"
print occurrences[:-1].strip()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM