简体   繁体   English

如何使用 Python 在文本文件中用土耳其语字符替换 Unicode 字符

[英]How can I replace Unicode characters with Turkish characters in a text file with Python

I am working on Twitter.我在推特上工作。 I got data from Twitter with Stream API and the result of app is JSON file.我使用 Stream API 从 Twitter 获取数据,应用程序的结果是 JSON 文件。 I wrote tweets data in a text file and now I see Unicode characters instead of Turkish characters.我在文本文件中写了推文数据,现在我看到了 Unicode 字符而不是土耳其语字符。 I don't want to do find/replace in Notepad++ by hand.我不想在 Notepad++ 中手动查找/替换。 Is there any automatic option to replace characters by opening txt file, reading all data in file and changing Unicode characters with Turkish characters by Python?是否有任何自动选项可以通过打开 txt 文件、读取文件中的所有数据并通过 Python 将 Unicode 字符更改为土耳其语字符来替换字符?

Here are Unicode characters and Turkish characters which I want to replace.这是我要替换的 Unicode 字符和土耳其语字符。

  • ğ - \ğ ? - \ğ
  • Ğ - \Ğ Ğ - \Ğ
  • ı - \ı我 - \ı
  • İ - \İ © - \İ
  • ö - \ö ö - \ö
  • Ö - \Ö Ö - \Ö
  • ü - \ü ü - \ü
  • Ü - \Ü ü - \Ü
  • ş - \ş ş - \ş
  • Ş - \Ş Ş - \Ş
  • ç - \ç ç - \ç
  • Ç - \Ç Ç - \Ç

I tried two different type我尝试了两种不同的类型

#!/usr/bin/env python

# -*- coding: utf-8 -*- 

import re

dosya = open('veri.txt', 'r')

for line in dosya:
    match = re.search(line, "\u011f")
    if (match):
        replace("\u011f", "ğ")

dosya.close()

and:和:

#!/usr/bin/env python

# -*- coding: utf-8 -*- 

f1 = open('veri.txt', 'r')
f2 = open('veri2.txt', 'w')

for line in f1:
    f2.write=(line.replace('\u011f', 'ğ')) 
    f2.write=(line.replace('\u011e', 'Ğ'))
    f2.write=(line.replace('\u0131', 'ı'))
    f2.write=(line.replace('\u0130', 'İ'))
    f2.write=(line.replace('\u00f6', 'ö'))
    f2.write=(line.replace('\u00d6', 'Ö'))
    f2.write=(line.replace('\u00fc', 'ü'))
    f2.write=(line.replace('\u00dc', 'Ü'))
    f2.write=(line.replace('\u015f', 'ş'))
    f2.write=(line.replace('\u015e', 'Ş'))
    f2.write=(line.replace('\u00e7', 'ç'))
    f2.write=(line.replace('\u00c7', 'Ç'))

f1.close()
f2.close()

Both of these didn't work.这两个都不起作用。 How can I make it work?我怎样才能让它工作?

JSON allows both "escaped" and "unescaped" characters. JSON 允许“转义”和“非转义”字符。 The reason why the Twitter API returns only escaped characters is that it can use the ASCII encoding, which increases interoperability. Twitter API 只返回转义字符的原因是它可以使用 ASCII 编码,这增加了互操作性。 For Turkish characters you need another encoding.对于土耳其语字符,您需要另一种编码。 Opening a file with the open function opens a file assuming your current locale encoding, which is probably what your editor expects.使用open函数打开文件会打开一个假定您当前的语言环境编码的文件,这可能是您的编辑器所期望的。 If you want the output file to have eg the ISO-8859-9 encoding, you can pass encoding='ISO-8859-9 ' as an additional parameter to the open function.如果您希望输出文件具有例如ISO-8859-9编码,您可以将encoding='ISO-8859-9 ' 作为附加参数传递给open函数。

You can read a file containing a JSON object with the json.load function.您可以使用json.load函数读取包含 JSON 对象的文件。 This returns a Python object with the escaped characters decoded.这将返回一个带有解码转义字符的 Python 对象。 Writing it again with json.dump and passing ensure_ascii=False as an argument writes the object back to a file without encoding Turkish characters as escape sequences.使用json.dump再次json.dump并传递ensure_ascii=False作为参数会将对象写回文件而不将土耳其语字符编码为转义序列。 An example:一个例子:

import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
in_as_obj = json.load(inp)
json.dump(in_as_obj, out, ensure_ascii=False)

Your file isn't really a JSON file, but instead a file containing multiple JSON objects.您的文件并不是真正的 JSON 文件,而是包含多个 JSON 对象的文件。 If each JSON object is on its own line, you can try the following:如果每个 JSON 对象都在自己的行上,您可以尝试以下操作:

import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
for line in inp:
    if not line.strip():
        out.write(line)
        continue
    in_as_obj = json.loads(line)
    json.dump(in_as_obj, out, ensure_ascii=False)
    out.write('\n')

But in your case it's probably better to write unescaped JSON to the file in the first place.但是在您的情况下,最好首先将未转义的 JSON 写入文件。 Try replacing your on_data method by (untested):尝试用(未经测试)替换你的on_data方法:

def on_data(self, raw_data):
    data = json.loads(raw_data)
    print(json.dumps(data, ensure_ascii=False))

You can use this method:您可以使用此方法:

# For Turkish Character
translationTable = str.maketrans("ğĞıİöÖüÜşŞçÇ", "gGiIoOuUsScC")

yourText = "Pijamalı Hasta Yağız Şoföre Çabucak Güvendi"
yourText = yourText.translate(translationTable)

print(yourText)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM