简体   繁体   中英

how can i encode/decode finnish characters?

I explain my problem as best I can, but here it is:

My program translates all the tweets to me and then integrates them into the JSON but as soon as it comes across a Finnish character the program sends me a "JSONDecodeError" as shown below:

Traceback (most recent call last):
  File "C:\Users\TheoLC\Desktop\python\twitter_search\collect+200tw.py", line 54, in <module>
    tweet.text = translator.translate(str(tweet.text), src='fi', dest='en')
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\site-packages\googletrans\client.py", line 172, in translate
    data = self._translate(text, dest, src)
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\site-packages\googletrans\client.py", line 81, in _translate
    data = utils.format_json(r.text)
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\site-packages\googletrans\utils.py", line 62, in format_json
    converted = legacy_format_json(original)
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\site-packages\googletrans\utils.py", line 54, in legacy_format_json
    converted = json.loads(text)
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\TheoLC\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
>>> 

The entire code :

translator = Translator()

search_word = input("subject ? \n")

search_word = TextBlob(search_word)

search_word_finnish = translator.translate(str(search_word), dest='fi')

search_word_french = translator.translate(str(search_word), dest='fr')

print("Mot en finnois : " + str(search_word_finnish.text) + " \n")
print("Mot en français : " + str(search_word_french.text) + " \n")


searched_tweets = []

taille = input("nb de tweets ?")

new_tweets_en = api.search(search_word, count=int(taille)/3)
new_tweets_fi = api.search(search_word_finnish.text, count=int(taille)/3)
new_tweets_fr = api.search(search_word_french.text, count=int(taille)/3)


print("j'ai trouver ", len(new_tweets_en), "tweets en anglais")
print("j'ai trouver ", len(new_tweets_fi), "tweets en finnois")
print("j'ai trouver ", len(new_tweets_fr), "tweets en français")

if not new_tweets_en and not new_tweets_fr and not new_tweets_fi:
    print("pas de tweets trouves")




for tweet in new_tweets_fi:
    tweet.text = translator.translate(str(tweet.text), src='fi', dest='en')

for tweet in new_tweets_fr:
    tweet.text = translator.translate(str(tweet.text), src='fr', dest='en')



new_tweets = new_tweets_en + new_tweets_fr + new_tweets_fi
searched_tweets.extend(new_tweets)


with open("%s_tweets.json" % search_word, 'a', encoding='utf-8') as f:
    for tweet in new_tweets:
        json.dump(tweet._json, f, indent=4, ensure_ascii=False)

for tweet in new_tweets:
    tweet_token = word_tokenize(str(tweet.text))
    print(u'Tweet tokenize : ' + str(tweet_token))
    print("\n")

Thank you to those who will be able to help me

Your file is not a JSON file, and also not linewise JSON, but just a bunch of JSON objects, without seperation. You can fix this with a small helper function:

def json_stream(badjson):
    offset = 0
    while True:
        try:
            data = json.loads(badjson[offset :])
            yield data
            break
        except json.JSONDecodeError as e:
            yield json.loads(badjson[offset : offset + e.pos])
            offset += e.pos

This, given a bad json string, yields deserialized data one by one.

If you, as you seem to do, need a fixed file, you can use this, too:

with open("fixed.jsonl", "w") as fw:
    with open("bad.json") as fr:
        for data in json_stream(fr.read()):
            fw.write(json.dumps(data))
            fw.write("\n")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM