简体   繁体   中英

How to remove all special characters in a large French text file

Given a large text file in French (>200GB) encoded in UTF-8 and normalised by unicode NFC , I want to remove all special characters except accented/unaccented alphabetical letters, numbers and punctuations using Python or Bash or whichever method that is faster. Previously, I do this task manually by scanning the text to identify if there is any special characters that I don't want and remove them using character codes like this:

def remove_special_chars(text):

    text = re.sub(chr(65533), '', text)
    text = re.sub(chr(9658), '', text) 
    text = re.sub(chr(9660), '', text)
    text = re.sub(chr(169), '', text)  

    return text

(char code 65533) ► (char code 9658) ▼ (char code 9660) © (char code 169) etc.

However, for a large text file, it is not possible to do it that way anymore. Therefore, I am thinking of removing all of the special characters by checking if a character is an (accented/unaccented) alphabetical letter or a number or a punctuation and removing if it is not. I tried the following but the command line does not execute.

grep -P -v '[^a-zA-Z0-9 àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ!"#\$%&\'\(\)\*\+,\\-\./:;<=>\?@\[\]\^_`\{\|\}\~]' file

Could you please help me on this problem? Thank you in advance for your help!

All the chars you want to remove belong to the Symbols, Other Unicode category .

In Python, you can install PyPi regex module , add

import regex

And then change the contents like this:

text = regex.sub(r'\p{So}+', '', text)

In Linux, you may do that with a Perl one-liner:

perl -i -CSD -Mutf8 -pe 's/\p{So}+//g' file

The -i option will modify the file inline, -CSD -Mutf8 are there since I believe your file is in UTF8 encoding.

I am assuming that your text are using the codepage for french canadian which is cp863 . One "hacky" method you can do without using regex is the following.

# this ignores any characters that are not in the standard french character page
text = "abcdeefghijkàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔ►�▼©".encode("cp863", "ignore")
print(text.decode('cp863'))

# outputs
abcdeefghijkàâèéêëîïôùûüçÀÂÈÉÊËÎÏÔ

I would use unicodedata module, which is a standard module, so it should already be in your system.

You should loop every character with unicodedata.category( chr ) , and check the category you want to keep, or you want to discard.

Unicode publish the general category values: https://www.unicode.org/reports/tr44/tr44-6.html#General_Category_Values

I would keep L* (letters), N* (numbers), P* (punctuations), and Zs (space). I would change other Z* into a space, and I would change other characters also into a space, but also save the line into a file, to check if you need to adapt rules.

Note: you may also restrict/transform other codes (eg opening parenthesis into just normal parenthesis, etc.) accoding your use.

Note: the above suggestion will remove also the $ (currency symbol), you may adapt it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM