How to remove all special characters in a large French text file

Question

Given a large text file in French (>200GB) encoded in UTF-8 and normalised by unicode NFC , I want to remove all special characters except accented/unaccented alphabetical letters, numbers and punctuations using Python or Bash or whichever method that is faster. Previously, I do this task manually by scanning the text to identify if there is any special characters that I don't want and remove them using character codes like this:

def remove_special_chars(text):

    text = re.sub(chr(65533), '', text)
    text = re.sub(chr(9658), '', text) 
    text = re.sub(chr(9660), '', text)
    text = re.sub(chr(169), '', text)  

    return text

(char code 65533) ► (char code 9658) ▼ (char code 9660) © (char code 169) etc.

However, for a large text file, it is not possible to do it that way anymore. Therefore, I am thinking of removing all of the special characters by checking if a character is an (accented/unaccented) alphabetical letter or a number or a punctuation and removing if it is not. I tried the following but the command line does not execute.

grep -P -v '[^a-zA-Z0-9 àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ!"#\$%&\'\(\)\*\+,\\-\./:;<=>\?@\[\]\^_`\{\|\}\~]' file

Could you please help me on this problem? Thank you in advance for your help!

Answer 1

All the chars you want to remove belong to the Symbols, Other Unicode category .

In Python, you can install PyPi regex module , add

import regex

And then change the contents like this:

text = regex.sub(r'\p{So}+', '', text)

In Linux, you may do that with a Perl one-liner:

perl -i -CSD -Mutf8 -pe 's/\p{So}+//g' file

The -i option will modify the file inline, -CSD -Mutf8 are there since I believe your file is in UTF8 encoding.

Answer 2

I am assuming that your text are using the codepage for french canadian which is cp863 . One "hacky" method you can do without using regex is the following.

# this ignores any characters that are not in the standard french character page
text = "abcdeefghijkàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔ►�▼©".encode("cp863", "ignore")
print(text.decode('cp863'))

# outputs
abcdeefghijkàâèéêëîïôùûüçÀÂÈÉÊËÎÏÔ

Answer 3

I would use unicodedata module, which is a standard module, so it should already be in your system.

You should loop every character with unicodedata.category( chr ) , and check the category you want to keep, or you want to discard.

Unicode publish the general category values: https://www.unicode.org/reports/tr44/tr44-6.html#General_Category_Values

I would keep L* (letters), N* (numbers), P* (punctuations), and Zs (space). I would change other Z* into a space, and I would change other characters also into a space, but also save the line into a file, to check if you need to adapt rules.

Note: you may also restrict/transform other codes (eg opening parenthesis into just normal parenthesis, etc.) accoding your use.

Note: the above suggestion will remove also the $ (currency symbol), you may adapt it.

How to remove all special characters in a large French text file

Question

3 answers

solution1
3 ACCPTED 2019-08-19 09:45:36

solution2
1 2019-08-19 09:40:45

solution3
1 2019-08-19 14:19:51

How to remove all special characters in a large French text file

Question

3 answers

solution1 3 ACCPTED 2019-08-19 09:45:36

solution2 1 2019-08-19 09:40:45

solution3 1 2019-08-19 14:19:51

solution1
3 ACCPTED 2019-08-19 09:45:36

solution2
1 2019-08-19 09:40:45

solution3
1 2019-08-19 14:19:51