File encoding from English text to UTF-8

Question

如何在Python中将带有CRLF行终止符的非ISO扩展ASCII英文文本转换为utf-8

Answer 1

Extending Jishiyu's Answer, you might use uchardet to identify the char set. For example

iconv -f `uchardet a_strange_file.txt` -t UTF-8 -o the_output_file.txt a_strange_file.txt

Although this does not do the job in python.

Answer 2

i think the linux command unix2dos、dos2unix、iconv will helpful。

such like

iconv -f latin-1 -t UTF-8 latin.txt >utf8.txt

Answer 3

If you obtain a raw byte-stream for your input file, you can then decode it to utf-8 . See this blog post with some Python 3 examples.

在此处输入图片说明

Answer 4

I have created an automated conversion script using the enca library, I use it on my NAS to convert subtitles to UTF-8 but it could be utilized for any automated conversion

Feel free to use :)

EDIT:

#!/bin/bash
LANGUAGE=czech
TO=utf8
CONVERT="enca -L $LANGUAGE -x $TO"

# Find and onvert
find ./ -type f -name "*.srt" | while read fn; do
  IS_TARGET=`enca "${fn}" | egrep -ow -m 1 'UTF-8|Unrecognized|KOI8-CS2|7bit ASCII|UCS-2|Macintosh Central European'`

    if [ "$IS_TARGET" != "UTF-8" ] &&
       [ "$IS_TARGET" != "UCS-2" ] &&
       [ "$IS_TARGET" != "Macintosh Central European" ] &&
       [ "$IS_TARGET" != "Unrecognized" ] &&
       [ "$IS_TARGET" != "7bit ASCII" ] &&
       [ "$IS_TARGET" != "KOI8-CS2" ]; then

        echo "${fn} ---- Will be converted!"
    # optional backup of original srt
        # cp "${fn}" "${fn}.bak"
        $CONVERT "${fn}"
    fi  

done

File encoding from English text to UTF-8

Question

4 answers

solution1
1 ACCPTED 2013-12-05 14:31:42

solution2
0 2012-05-01 07:26:46

solution3
0 2012-05-01 08:23:54

solution4
0 2016-08-29 12:27:10

File encoding from English text to UTF-8

Question

4 answers

solution1 1 ACCPTED 2013-12-05 14:31:42

solution2 0 2012-05-01 07:26:46

solution3 0 2012-05-01 08:23:54

solution4 0 2016-08-29 12:27:10

solution1
1 ACCPTED 2013-12-05 14:31:42

solution2
0 2012-05-01 07:26:46

solution3
0 2012-05-01 08:23:54

solution4
0 2016-08-29 12:27:10