Given a large text file in French (>200GB) encoded in UTF-8 and normalised by unicode NFC
, I want to remove all special characters except accented/unaccented alphabetical letters, numbers and punctuations using Python or Bash or whichever method that is faster. Previously, I do this task manually by scanning the text to identify if there is any special characters that I don't want and remove them using character codes like this:
def remove_special_chars(text):
text = re.sub(chr(65533), '', text)
text = re.sub(chr(9658), '', text)
text = re.sub(chr(9660), '', text)
text = re.sub(chr(169), '', text)
return text
(char code 65533) ► (char code 9658) ▼ (char code 9660) © (char code 169) etc.
However, for a large text file, it is not possible to do it that way anymore. Therefore, I am thinking of removing all of the special characters by checking if a character is an (accented/unaccented) alphabetical letter or a number or a punctuation and removing if it is not. I tried the following but the command line does not execute.
grep -P -v '[^a-zA-Z0-9 àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ!"#\$%&\'\(\)\*\+,\\-\./:;<=>\?@\[\]\^_`\{\|\}\~]' file
Could you please help me on this problem? Thank you in advance for your help!
All the chars you want to remove belong to the Symbols, Other Unicode category .
In Python, you can install PyPi regex module , add
import regex
And then change the contents like this:
text = regex.sub(r'\p{So}+', '', text)
In Linux, you may do that with a Perl one-liner:
perl -i -CSD -Mutf8 -pe 's/\p{So}+//g' file
The -i
option will modify the file inline, -CSD -Mutf8
are there since I believe your file is in UTF8 encoding.
I am assuming that your text are using the codepage for french canadian which is cp863
. One "hacky" method you can do without using regex is the following.
# this ignores any characters that are not in the standard french character page
text = "abcdeefghijkàâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔ►�▼©".encode("cp863", "ignore")
print(text.decode('cp863'))
# outputs
abcdeefghijkàâèéêëîïôùûüçÀÂÈÉÊËÎÏÔ
I would use unicodedata
module, which is a standard module, so it should already be in your system.
You should loop every character with unicodedata.category(
chr
)
, and check the category you want to keep, or you want to discard.
Unicode publish the general category values: https://www.unicode.org/reports/tr44/tr44-6.html#General_Category_Values
I would keep L*
(letters), N*
(numbers), P*
(punctuations), and Zs
(space). I would change other Z*
into a space, and I would change other characters also into a space, but also save the line into a file, to check if you need to adapt rules.
Note: you may also restrict/transform other codes (eg opening parenthesis into just normal parenthesis, etc.) accoding your use.
Note: the above suggestion will remove also the $
(currency symbol), you may adapt it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.