简体   繁体   English

Python在CSV中删除非拉丁文字行

[英]Python remove non-latin textlines in csv

I have a csv file which contains text in form of strings. 我有一个csv文件,其中包含字符串形式的文本。 Some text lines are for example in chinese or russian. 一些文本行例如是中文或俄语。

What I want to do is use Python to count the number of unicode and ASCII characters in the text line. 我想做的是使用Python计算文本行中unicode和ASCII字符的数量。 If the ratio of ASCII to Unicode characters is over 90% I want to keep the line and if not remove it from the csv. 如果ASCII与Uni​​code字符的比率超过90%,我想保留该行,如果不从CSV中删除它。

The idea behind this is to remove all non-latin languages but keep for example the german Umlauts, for this I want to use solution with the ratio. 这背后的想法是删除所有非拉丁语言,但保留例如德国的Umlauts,为此,我要使用具有比率的解决方案。

Has anyone an idea to solve this task? 有没有人想解决这个任务?

Thank you very much! 非常感谢你!

Here is some example of my csv data: 这是我的csv数据的一些示例:

She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
RT @YuaElena: Бабушка лаÑково говорит 5-летнему Тёмочке: - Смотри, Темик, вон едет "би-би". - Бог Ñ Ñ‚Ð¾Ð±Ð¾Ð¹, бабка, Ñто-ж BMW 335xi 4x4.

So you should have an idea how my data looks like. 因此,您应该了解我的数据的样子。

The latin range ends with \ÿ , so all you have to do is remove characters in the range \Ā-\￿ using a regexp and then compare the new line length to the original one. 拉丁语的范围以\ÿ ,因此您要做的就是使用正则表达式删除\Ā-\￿范围内的字符,然后将新行的长度与原始行进行比较。

That said, it might be more useful to use re.sub(r'[\Ā-\￿]', "?", line) to keep the line and replace all unwanted characters with ? 也就是说,使用re.sub(r'[\Ā-\￿]', "?", line)保留该行并将所有不需要的字符替换为?可能更有用? .

Your best bet is probably to use the unicodedata module. 最好的选择是使用unicodedata模块。 The solution is a bit resource intensive because it will check the unicode name of each character in the string. 该解决方案会占用大量资源,因为它将检查字符串中每个字符的Unicode名称。

import unicodedata
def compute_ratio(input_str):
    '''
    This function will return the ratio between the number of latin letter and other letters.
    '''
    num_latin = 0
    input_str = "".join(input_str.split()) # Remove whitespaces.
    for char in input_str:
        try:
            if unicodedata.name(unicode(char))[:5] == "LATIN":
                num_latin += 1
            #end if
        except UnicodeDecodeError:
            pass
        #end try
    #end for
    return (num_latin*1.0)/len(input_str)

Here's a usage example with your input data. 这是输入数据的用法示例。 saved_Output is an array containing all the lines which are valid. saved_Output是一个包含所有有效行的数组。

>>> lines = '''She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
RT @YuaElena: Бабушка лаÑково говорит 5-летнему Тёмочке: - Смотри, Темик, вон едет "би-би". - Бог Ñ Ñ‚Ð¾Ð±Ð¾Ð¹, бабка, Ñто-ж BMW 335xi 4x4.'''
>>> saved_Output = []
>>> for line in lines.split('\n'):
        if compute_ratio(line) > 0.95:
            saved_Output.append(line)
        #end if
#end for

>>> "\n".join(saved_Output)
''
>>> compute_ratio('She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ')
0.890625
>>> # A ratio of 0.95 seems too high even for your first line.
>>> compute_ratio('this is a long string')
0.8095238095238095
>>> compute_ratio(u"c'est une longue cha\xeene")
0.8260869565217391

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM