简体   繁体   English

在Sublime Text 3上构建的Python 2.7不会打印'\\ uFFFD'字符

[英]Python 2.7 build on Sublime Text 3 doesn't print the '\uFFFD' character

The problem. 问题。

I'm using Python 2.7 build on Sublime Text 3 and have an issue with printing out. 我在Sublime Text 3上使用Python 2.7构建并且在打印时遇到问题。
In some cases I get a pretty confusing output for '\�' - the 'REPLACEMENT CHARACTER' . 在某些情况下,我得到了'\�'的相当混乱的输出 - 'REPLACEMENT CHARACTER'


For example: 例如:

print u'\ufffd' # should be '�' - the 'REPLACEMENT CHARACTER'
print u'\u0061' # should be 'a'
-----------------------------------------------------
[Finished in 0.1s]

After inversion of the order: 倒序后:

print u'\u0061' 
print u'\ufffd'
-----------------------------------------------------
a
�
[Finished in 0.1s]

So, Sublime can printout the ' ' character, but for some reason doesn't do it in the 1st case. 因此,Sublime可以打印出' '字符,但出于某种原因在第一种情况下不会这样做。
And the dependence of the output on the order of statements seems quite strange. 输出对语句顺序的依赖似乎很奇怪。


The problem with replacement char leads to very unpredictable printout behavior in general. 替换char的问题通常导致非常不可预测的打印输出行为。
For example, I want to printout decoded bytes with error replacement: 例如,我想打印出错误替换的解码字节:

cp1251_bytes = '\xe4\xe0' # 'да' in cp1251 
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
��
[Finished in 0.1s]

Let's replace the bytes: 让我们替换字节:

cp1251_bytes = '\xed\xe5\xf2' # 'нет' in cp1251
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
[Finished in 0.1s]

And add one more print statement: 并添加一个print语句:

cp1251_bytes = '\xed\xe5\xf2' # 'нет' in cp1251 
print cp1251_bytes.decode('cp1251') 
print cp1251_bytes.decode('utf-8', errors='replace')
-----------------------------------------------------
нет
���
[Finished in 0.1s]

Below is the illustration of implementation some other test cases: 下面是一些其他测试用例的实现说明:

在此输入图像描述


Summarizing , there are the following patterns in the described printout behavior: 总而言之 ,所描述的打印输出行为中存在以下模式:

  • it depends on the even/odd number of '\�' chars in print statement 它取决于print语句中'\�'字符的偶数/奇数
  • it depends on the order of print statements 这取决于打印语句的顺序
  • it depends on the specific build run 这取决于具体的构建运行


    My questions: 我的问题:

  • Why does this happen? 为什么会这样?
  • How to fix the problem? 如何解决问题?



    My Python 2.7 sublime-build file: 我的Python 2.7 sublime-build文件:

     { "cmd": ["C:\\\\_Anaconda3\\\\envs\\\\python27\\\\python", "-u", "$file"], "file_regex": "^[ ]*File \\"(...*?)\\", line ([0-9]*)", "selector": "source.python", "env": {"PYTHONIOENCODING": "utf-8"} } 

    With Python 2.7 installed separately from Anaconda the behavior is exactly the same. 使用与Anaconda分开安装的Python 2.7,行为完全相同。

  • Edit-1 - Using UTF8 with BOM 编辑-1 - 使用带有BOM的UTF8

    Seems like BOM becomes important in case of windows. 对于Windows来说,BOM似乎很重要。 So you need to use below type build config 所以你需要使用下面的类型构建配置

    {   
        "cmd": ["F:\\Python27-14\\python", "-u", "$file"],
        "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
        "selector": "source.python",
        "env": {
            "PYTHONIOENCODING": "utf_8_sig"
        },
    }
    

    After that it works correctly for me on windows also 之后,它也可以在Windows上正常使用

    构建设置

    正确的输出

    Original Answer 原始答案

    I checked the issue and I didn't face the same on Python 2.7 with Sublime text. 我检查了这个问题,我在Sublime文本的Python 2.7上没有遇到同样的问题。 The only change being I had to add # -*- coding: utf-8 -*- to the top of the file. 唯一的变化是我必须将# -*- coding: utf-8 -*-到文件的顶部。 Which seems the missing part in this question 这似乎是这个问题的缺失部分

    # -*- coding: utf-8 -*-
    
    print u'\u0061' # should be 'a'
    print u'\ufffd' # should be '�' - the 'REPLACEMENT CHARACTER'
    

    After that the reversal has no impact 之后逆转没有影响

    打印1

    打印2

    You can see more details about this required header on 您可以在上面看到有关此必需标头的更多详细信息

    Why declare unicode by string in python? 为什么在python中用字符串声明unicode?

    Below is summary of the above link 以下是上述链接的摘要

    When you specify # -*- coding: utf-8 -*- , you're telling Python the source file you've saved is utf-8 . 当你指定# -*- coding: utf-8 -*- ,你告诉Python你保存的源文件是utf-8 The default for Python 2 is ASCII (for Python 3 it's utf-8 ). Python 2的默认值是ASCII(对于Python 3,它是utf-8 )。 This just affects how the interpreter reads the characters in the file. 这只会影响解释器读取文件中字符的方式。

    I've reproduced your problem and I've found a solution that works on my platform anyhow: Remove the -u flag from your cmd build config option . 我已经重现了你的问题,我发现无论如何都能在我的平台上运行一个解决方案: cmd build config选项中删除-u标志

    I'm not 100% sure why that works, but it seems to be a poor interaction resulting from the console interpreting an unbuffered stream of data containing multi-byte characters. 我不是百分之百确定为什么会这样,但是由于控制台解释了包含多字节字符的无缓冲数据流,这似乎是一种糟糕的交互。 Here's what I've found: 这是我发现的:

    • The -u option switches Python's output to unbuffered -u选项将Python的输出切换为unbuffered
    • This problem is not at all specific to the replacement character. 这个问题根本不是替换字符所特有的。 I've gotten similar behaviour with other characters like "あ" (U+3042). 我和其他人物如“あ”(U + 3042)有类似的行为。
    • Similar bad results happen with other encodings. 其他编码也会出现类似的糟糕结果。 Setting "env": {"PYTHONIOENCODING": "utf-16be"} results in print u'\あ' outputting 0B . 设置"env": {"PYTHONIOENCODING": "utf-16be"}导致print u'\あ'输出0B

    That last example with the encoding set to UTF-16BE illustrates what I think is going on. 将编码设置为UTF-16BE的最后一个示例说明了我的想法。 The console is receiving one byte at a time because the output is unbuffered. 控制台一次接收一个字节,因为输出是无缓冲的。 So it receives the 0x30 byte first. 所以它首先接收0x30字节。 The console then determines this is not valid UTF-16BE and decides instead to fallback to ASCII and thus outputs 0 . 然后控制台确定这是无效的UTF-16BE,而是决定回退到ASCII,从而输出0 It of courses receives the next byte right after and follows the same logic to output B . 它的课程接收后面的下一个字节,并遵循相同的逻辑输出B

    With the UTF-8 encoding, the console receives bytes that can't possibly be interpreted as ASCII, so I believe the console is doing a slightly better job at properly interpreting the unbuffered stream, but it is still running into the difficulties that your question points out. 使用UTF-8编码,控制台接收的字节不可能被解释为ASCII,所以我相信控制台在正确解释无缓冲的流方面做得稍微好一些,但它仍然遇到了你的问题的困难指出。

    声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM