Python remove non-latin textlines in csv

Question

I have a csv file which contains text in form of strings. Some text lines are for example in chinese or russian.

What I want to do is use Python to count the number of unicode and ASCII characters in the text line. If the ratio of ASCII to Unicode characters is over 90% I want to keep the line and if not remove it from the csv.

The idea behind this is to remove all non-latin languages but keep for example the german Umlauts, for this I want to use solution with the ratio.

Has anyone an idea to solve this task?

Thank you very much!

Here is some example of my csv data:

She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
RT @YuaElena: Ð‘Ð°Ð±ÑƒÑˆÐºÐ° Ð»Ð°ÑÐºÐ¾Ð²Ð¾ Ð³Ð¾Ð²Ð¾Ñ€Ð¸Ñ‚ 5-Ð»ÐµÑ‚Ð½ÐµÐ¼Ñƒ Ð¢Ñ‘Ð¼Ð¾Ñ‡ÐºÐµ: - Ð¡Ð¼Ð¾Ñ‚Ñ€Ð¸, Ð¢ÐµÐ¼Ð¸Ðº, Ð²Ð¾Ð½ ÐµÐ´ÐµÑ‚ "Ð±Ð¸-Ð±Ð¸". - Ð‘Ð¾Ð³ Ñ Ñ‚Ð¾Ð±Ð¾Ð¹, Ð±Ð°Ð±ÐºÐ°, ÑÑ‚Ð¾-Ð¶ BMW 335xi 4x4.

So you should have an idea how my data looks like.

Answer 1

The latin range ends with \ÿ , so all you have to do is remove characters in the range \Ā-\ using a regexp and then compare the new line length to the original one.

That said, it might be more useful to use re.sub(r'[\Ā-\]', "?", line) to keep the line and replace all unwanted characters with ? .

Answer 2

Your best bet is probably to use the unicodedata module. The solution is a bit resource intensive because it will check the unicode name of each character in the string.

import unicodedata
def compute_ratio(input_str):
    '''
    This function will return the ratio between the number of latin letter and other letters.
    '''
    num_latin = 0
    input_str = "".join(input_str.split()) # Remove whitespaces.
    for char in input_str:
        try:
            if unicodedata.name(unicode(char))[:5] == "LATIN":
                num_latin += 1
            #end if
        except UnicodeDecodeError:
            pass
        #end try
    #end for
    return (num_latin*1.0)/len(input_str)

Here's a usage example with your input data. saved_Output is an array containing all the lines which are valid.

>>> lines = '''She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
RT @YuaElena: Ð‘Ð°Ð±ÑƒÑˆÐºÐ° Ð»Ð°ÑÐºÐ¾Ð²Ð¾ Ð³Ð¾Ð²Ð¾Ñ€Ð¸Ñ‚ 5-Ð»ÐµÑ‚Ð½ÐµÐ¼Ñƒ Ð¢Ñ‘Ð¼Ð¾Ñ‡ÐºÐµ: - Ð¡Ð¼Ð¾Ñ‚Ñ€Ð¸, Ð¢ÐµÐ¼Ð¸Ðº, Ð²Ð¾Ð½ ÐµÐ´ÐµÑ‚ "Ð±Ð¸-Ð±Ð¸". - Ð‘Ð¾Ð³ Ñ Ñ‚Ð¾Ð±Ð¾Ð¹, Ð±Ð°Ð±ÐºÐ°, ÑÑ‚Ð¾-Ð¶ BMW 335xi 4x4.'''
>>> saved_Output = []
>>> for line in lines.split('\n'):
        if compute_ratio(line) > 0.95:
            saved_Output.append(line)
        #end if
#end for

>>> "\n".join(saved_Output)
''
>>> compute_ratio('She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ')
0.890625
>>> # A ratio of 0.95 seems too high even for your first line.
>>> compute_ratio('this is a long string')
0.8095238095238095
>>> compute_ratio(u"c'est une longue cha\xeene")
0.8260869565217391

Python remove non-latin textlines in csv

Question

2 answers

solution1
1 2013-09-03 07:35:46

solution2
0 ACCPTED 2013-09-03 07:54:38

Python remove non-latin textlines in csv

Question

2 answers

solution1 1 2013-09-03 07:35:46

solution2 0 ACCPTED 2013-09-03 07:54:38

solution1
1 2013-09-03 07:35:46

solution2
0 ACCPTED 2013-09-03 07:54:38