简体   繁体   中英

Removing symbols from a large unicode text file

I have a text file that contains Unicode texts sizing 2GB approximately. I tried to remove all symbols using following code

import re
symbols = re.compile(r'[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %',flags=re.UNICODE)

with open('/home/corpus/All12.txt','a') as t:
    with open('/home/corpus/All11.txt', 'r') as n:
        data = n.readline()          
        data = symbols.sub(" ", data)          
        t.write(data)

A small file for testing the code:

:621   

"

    :621       "
    :621               :1                ;"
     _            "         :594            :25   4   8   0        :23          "സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍    
    :621            :4   0   3   0  ;"
     _           "         :551             :16        :3  " 

     :12     :70                ;"                  "             "     =""                   "               "     =""                     "            "     ="" +    


     _                       "         :541             :26       :30   45   5   35  " 
 ='                  'ന്യൂഡല്‍ഹി: സര്‍ക്കാര്‍ജീവനക്കാരായ ഭര്‍ത്താക്കന്മാരുടെ ശമ്പളം 

The desire output is ന്യൂഡല്‍ഹി സര്‍ക്കാര്‍ജീവനക്കാരായ ഭര്‍ത്താക്കന്മാരുടെ ശമ്പളം . The code is not functioning. It stops my computer.

Can I solve this problem with out regular expression ?

You need to insert every symbol you want to replace in square brackets [] , escape some special symbols like [] itself, single quote ' and \\ . The regex is r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\\[\\]@$*\\'\\\\^~\\n\\t]' .

Demo:

>>> st='1234567890-=[]\;,./\'!@#$%^&*()_+{}|":<>?//.,`~ajshgasd'
>>> print re.sub(r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^`~\n\t]','',st)
ajshgasd

On file:

>>> fp=open('file.txt','r')    
>>> for line in fp:
...     if line.strip() == '': continue  # strip() removes leading and trailing spaces
...     print re.sub(r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^`~]','',line).strip(),
... 
    ന്യൂഡല്‍ഹി സര്‍ക്കാര്‍ജീവനക്കാരായ ഭര്‍ത്താക്കന്മാരുടെ ശമ്പളം

For writing output to a file use this code:

of=open('outfile.txt','w')
fp=open('file.txt','r')
for line in fp:
    if line.strip() == '': continue  # strip() removes leading and trailing spaces
    rline = re.sub(r'[-0-9{}&+()"=!.?:/|»©><#«,_+;%\[\]@$*\'\\^`~]','',line).strip()
    if rline == '': continue # skip empty lines
    of.write(rline+'\n')

of.close()
fp.close()

str.translate can be used instead of re.sub . It takes a mapping of Unicode ordinal to replacement pairs and returns the translated string. If the replacement is None it deletes the characters. str.maketrans can be used to generate the mapping.

In Python 3, also remember to specify the encoding of the files. I used UTF-8 for testing:

#!python3
#coding: utf8
symbols = ' {}&+()"=!.?.:../|»©:><#«,123456789_-+;[]%'
D = str.maketrans('','',symbols)
with open('All12.txt','a',encoding='utf8') as t, open('All11.txt','r',encoding='utf8') as n:
    for line in n:
        t.write(line.translate(D))

Just list whatever symbols you want to delete in symbols .

Alternatively, you can read the file in blocks of characters, which will be more efficient than reading over 10 million lines individually. Read the file in, for example, 20+ 100MB blocks instead.

#!python3
#coding: utf8
symbols = ' {}&+()"=!.?.:../|»©:><#«,123456789_-+;[]%'
D = str.maketrans('','',symbols)
with open('All12.txt','a',encoding='utf8') as t, open('All11.txt','r',encoding='utf8') as n:
    while True:
        block = n.read(100*1024*1024)
        if not block:
            break
        t.write(block.translate(D))

Ref: str.translate , str.maketrans

The re after the list of symbols between the first [ and ] makes no sense to me. It will not strip symbolds, but will only remove a symbol followed by '1 2 3 4 5 6 7 8 9 _ - + ; [ ] %'. In other work, the re.sub will not do anything. But anyway, your code runs on 3.4.2, Win7.

import re
symbols = re.compile(r'[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,]'
                     '1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %',flags=re.UNICODE)
text = ('''" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23'''
        '''"സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍'''
        '''ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍\n'''
        ''':621 :4 0 3 0 ;" _ " :551 :16 :3 " ''')
data = symbols.sub(" ", text)          
print(data == text)  # True

PS. with statements can have multiple clauses (to save indent levels).

with open('/home/corpus/All12.txt','a') as t,\
     open('/home/corpus/All11.txt', 'r') as n:
[{} &+( )" =!.?.:.. / |  » © : >< #  «  , 1 2 3 4 5 6 7 8 9 _ - + ; \[ \]  %]

Try this.Replace by empty string .See demo.

http://regex101.com/r/oE6jJ1/18

import re
p = re.compile(ur'[{} &+( )" =!.?.:.. / | » © : >< # « , 1 2 3 4 5 6 7 8 9 _ - + ; \[ \] %]', re.IGNORECASE | re.UNICODE)
test_str = u" :621 \" :621 :1 ;\" _ \" :594 :25 4 8 0 :23 \"സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍\n:621 :4 0 3 0 ;\" _ \" :551 :16 :3"
subst = u""

result = re.sub(p, subst, test_str)

Solution WITHOUT REGEX:

You can use the map function along with a set of symbols you want to remove to accomplish this.

def removeSymbols(text,symbols):
    return "".join(map(lambda x: "" if x in symbols else x,text))

>>> string = '''" :621 \" :621 :1 ;\" _ \" :594 :25 4 8 0 :23 \"സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാരക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍\n:621 :4 0 3 0 ;\" _ \" :551 :16 :3"'''    

>>> symbols = set('[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %')

>>> cleanString = removeSymbols(string,symbols)

>>> print(cleanString)

'" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23 "സര്\u200dക്കാര്\u200dജീവനക്കാരുടെ ശമ്പളം അറിയാന്\u200d ഭാര്യമാര്\u200dക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്\u200d\n:621 :4 0 3 0 ;" _ " :551 :16 :3"'

I thing your regular expression is not correct since you can simplify it. For example, the sub-expression [{} &+( )" =!.?.:.. / | » © : >< # « ,] can be simplify in [ !"#&()+,./:<=>?{|}©«»] : only keep each character one time. This is because [] is used to indicate a set of characters. Take a look at the chapter "Regular expression operations" in the Python documentation. See: https://docs.python.org/3.4/library/re.html

In the title of your message, you wrote: "Removing symbols from a large unicode text file", so I think that you have a set of characters you want to remove from your file.

To simplify you set of symbols, you can try:

>>> symbols = "".join(frozenset(r'[{} &+( )" =!.?.:.. / |  » © : >< #  «  ,] 1 2 3 4 5 6 7 8 9 _ - + ; [ ]  %'))
>>> print(symbols)
! #"%&)(+-,/.132547698»:=<?>[];_|©{}«

That way you can simply write:

symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'

Note for the readers: this is not obvious but all the strings here are unicode strings. I think, the author use Python 3. For Python 2.7 users, the best way is to use the "utf8" encoding and the u"" syntax, that way:

 # -*- coding: utf8 -*- symbols = u'! #"%&)(+-,/.132547698»:=<?>[];_|©{}«' 

Alternatively, you can import unicode_literals, and drop the "u" prefix:

 # -*- coding: utf8 -*- from __future__ import unicode_literals symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«' 

If you want to write a regular expression which match one symbol, you have to escape the characters with specials meanings (for example: "[" should be escaped in "\\["). The best way is to use re.escape function.

>>> import re
>>> symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
>>> regex = "[{0}]".format(re.escape(symbols))
>>> print(regex)
[\!\ \#\"\%\&\)\(\+\-\,\/\.132547698\»\:\=\<\?\>\[\]\;\_\|\©\{\}\«]

Just have a try:

import re

symbols = '! #"%&)(+-,/.132547698»:=<?>[];_|©{}«'
regex = "[{0}]+".format(re.escape(symbols))

example = '''" :621 " :621 :1 ;" _ " :594 :25 4 8 0 :23 "സര്‍ക്കാര്‍ജീവനക്കാരുടെ ശമ്പളം അറിയാന്‍ ഭാര്യമാര്‍ക്ക് അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍
:621 :4 0 3 0 ;" _ " :551 :16 :3 "'''

print(re.sub(regex, "", example, re.UNICODE))

Note that zero isn't in the symbols set but space are, so the result will be:

'''0സര്‍ക്കാര്‍ജീവനക്കാരുടെശമ്പളംഅറിയാന്‍ഭാര്യമാര്‍ക്ക്അവകാശമുണ്ട്വിവരാവകാശകമ്മീഷന്‍
00'''

I think the correct symbols set is: !#"%&)(+-,/.0132547698»:=<?>[];_|©{}« . Then you can strip each line to remove trailing white spaces...

So this code snippet should work for you:

import re

symbols = '!#"%&)(+-,/.0132547698»:=<?>[];_|©{}«'
regex = "[{0}]+".format(re.escape(symbols))
sub_symbols = re.compile(regex, re.UNICODE).sub

with open('/home/corpus/All12.txt', 'a') as t:
    with open('/home/corpus/All11.txt', 'r') as n:
        data = n.readline()
        data = sub_symbols("", data).strip()
        t.write(data)

Have you considered decoding the unicode such as:

line = line.decode('utf_8')

then re-encoding to let's say... ascii while ignoring characters it doesn't know such as:

line = line.encode('ascii', 'ignore')

Not sure that's any faster or better. Regular expressions are slow, but I don't know empirically that this is better. It's pretty easy though ;)

Probably O(2n) complexity (combined), but a long regular expression might be just as bad.

UPDATE: This is wrong as pointed out below.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM