简体   繁体   中英

Python remove non-latin textlines in csv

I have a csv file which contains text in form of strings. Some text lines are for example in chinese or russian.

What I want to do is use Python to count the number of unicode and ASCII characters in the text line. If the ratio of ASCII to Unicode characters is over 90% I want to keep the line and if not remove it from the csv.

The idea behind this is to remove all non-latin languages but keep for example the german Umlauts, for this I want to use solution with the ratio.

Has anyone an idea to solve this task?

Thank you very much!

Here is some example of my csv data:

She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
RT @YuaElena: Бабушка лаÑково говорит 5-летнему Тёмочке: - Смотри, Темик, вон едет "би-би". - Бог Ñ Ñ‚Ð¾Ð±Ð¾Ð¹, бабка, Ñто-ж BMW 335xi 4x4.

So you should have an idea how my data looks like.

The latin range ends with \ÿ , so all you have to do is remove characters in the range \Ā-\￿ using a regexp and then compare the new line length to the original one.

That said, it might be more useful to use re.sub(r'[\Ā-\￿]', "?", line) to keep the line and replace all unwanted characters with ? .

Your best bet is probably to use the unicodedata module. The solution is a bit resource intensive because it will check the unicode name of each character in the string.

import unicodedata
def compute_ratio(input_str):
    '''
    This function will return the ratio between the number of latin letter and other letters.
    '''
    num_latin = 0
    input_str = "".join(input_str.split()) # Remove whitespaces.
    for char in input_str:
        try:
            if unicodedata.name(unicode(char))[:5] == "LATIN":
                num_latin += 1
            #end if
        except UnicodeDecodeError:
            pass
        #end try
    #end for
    return (num_latin*1.0)/len(input_str)

Here's a usage example with your input data. saved_Output is an array containing all the lines which are valid.

>>> lines = '''She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
RT @YuaElena: Бабушка лаÑково говорит 5-летнему Тёмочке: - Смотри, Темик, вон едет "би-би". - Бог Ñ Ñ‚Ð¾Ð±Ð¾Ð¹, бабка, Ñто-ж BMW 335xi 4x4.'''
>>> saved_Output = []
>>> for line in lines.split('\n'):
        if compute_ratio(line) > 0.95:
            saved_Output.append(line)
        #end if
#end for

>>> "\n".join(saved_Output)
''
>>> compute_ratio('She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ')
0.890625
>>> # A ratio of 0.95 seems too high even for your first line.
>>> compute_ratio('this is a long string')
0.8095238095238095
>>> compute_ratio(u"c'est une longue cha\xeene")
0.8260869565217391

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM