How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?

Question

I am trying to scrape text in instagram and check if I could find some keywords in the bio but the user use a special fonts, so I cannot identify the specific word, how can I remove the fonts or formot of a text such that I can search the word?

import re
test="𝙄𝙣𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 𝙩𝙝𝙚𝙣 𝙚𝙭𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙥𝙖𝙨𝙩. "


x = re.findall(re.compile('past'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

Another example:

import re
test="ғʀᴇᴇʟᴀɴᴄᴇ ɢʀᴀᴘʜɪᴄ ᴅᴇsɪɢɴᴇʀ"
test=test.lower()

x = re.findall(re.compile('graphic'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

TEXT NOT FOUND

Answer 1

you can use unicodedata.normalize that Return the normal form for the Unicode string. For your examples see the following code snippet:

import re
import unicodedata

test="𝙄𝙣𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 𝙩𝙝𝙚𝙣 𝙚𝙭𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙥𝙖𝙨𝙩. "
 
formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')

x = re.findall(re.compile('past'), formatted_test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

and the output will be:

TEXT FOUND

Answer 2

Take care if you are dealing with texts in Portuguese. If you have:

string = """𝓿𝓲𝓫𝓻𝓪𝓷𝓽𝓮𝓼 orçamento"""

And you use:

unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8')

You will lost cedilha (ç), it means, orçamento will be orcamento.

Otherwise, if you use:

unicodedata.normalize('NFKC', string)

You will keep cedilha.

Note that I changed NFKD to NFKC , beyond cut encode and decode.

How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?

Question

2 answers

solution1
3 ACCPTED 2022-02-17 21:57:54

solution2
0 2023-01-20 15:08:01

How do I convert a unicode text to a text that python can read so that I could find that specific word in webscraping results?

Question

2 answers

solution1 3 ACCPTED 2022-02-17 21:57:54

solution2 0 2023-01-20 15:08:01

solution1
3 ACCPTED 2022-02-17 21:57:54

solution2
0 2023-01-20 15:08:01