简体   繁体   English

如何在Python doctests中包含unicode字符串?

[英]How do I include unicode strings in Python doctests?

I am working on some code that has to manipulate unicode strings. 我正在研究一些必须操纵unicode字符串的代码。 I am trying to write doctests for it, but am having trouble. 我正在尝试为它编写doctests,但我遇到了麻烦。 The following is a minimal example that illustrates the problem: 以下是说明问题的最小示例:

# -*- coding: utf-8 -*-
def mylen(word):
  """
  >>> mylen(u"áéíóú")
  5
  """
  return len(word)

print mylen(u"áéíóú")

First we run the code to see the expected output of print mylen(u"áéíóú") . 首先,我们运行代码以查看print mylen(u"áéíóú")的预期输出print mylen(u"áéíóú")

$ python mylen.py
5

Next, we run doctest on it to see the problem. 接下来,我们运行doctest来查看问题。

$ python -m
5
**********************************************************************
File "mylen.py", line 4, in mylen.mylen
Failed example:
    mylen(u"áéíóú")
Expected:
    5
Got:
    10
**********************************************************************
1 items had failures:
   1 of   1 in mylen.mylen
***Test Failed*** 1 failures.

How then can I test that mylen(u"áéíóú") evaluates to 5? 那我怎么测试那个mylen(u"áéíóú")评价为5?

If you want unicode strings, you have to use unicode docstrings! 如果你想要unicode字符串,你必须使用unicode docstrings! Mind the u ! 记住u

# -*- coding: utf-8 -*-
def mylen(word):
  u"""        <----- SEE 'u' HERE
  >>> mylen(u"áéíóú")
  5
  """
  return len(word)

print mylen(u"áéíóú")

This will work -- as long as the tests pass. 这将有效 - 只要测试通过。 For Python 2.x you need yet another hack to make verbose doctest mode work or get correct tracebacks when tests fail: 对于Python 2.x,您需要另一个hack来使详细的doctest模式工作或在测试失败时获得正确的回溯:

if __name__ == "__main__":
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8")
    import doctest
    doctest.testmod()

NB! NB! Only ever use setdefaultencoding for debug purposes. 只能使用setdefaultencoding进行调试。 I'd accept it for doctest use, but not anywhere in your production code. 我接受它用于doctest使用,但不接受你的生产代码中的任何地方。

Python 2.6.6 doesn't understand unicode output very well, but this can be fixed using: Python 2.6.6不能很好地理解unicode输出,但可以使用以下方法修复:

  • already described hack with sys.setdefaultencoding("UTF-8") 已经用sys.setdefaultencoding("UTF-8")描述了hack
  • unicode docstring (already mentioned above too, thanks a lot) unicode docstring(上面已经提到过,非常感谢)
  • AND print statement. AND print声明。

In my case this docstring tells that test is broken: 在我的情况下,这个docstring告诉测试被破坏:

def beatiful_units(*units):
    u'''Returns nice string like 'erg/(cm² sec)'.

    >>> beatiful_units(('erg', 1), ('cm', -2), ('sec', -1))
    u'erg/(cm² sec)'
    '''

with "error" message 带有“错误”消息

Failed example:
    beatiful_units(('erg', 1), ('cm', -2), ('sec', -1))
Expected:
    u'erg/(cm² sec)'
Got:
    u'erg/(cm\xb2 sec)'

Using print we can fix that: 使用print我们可以解决这个问题:

def beatiful_units(*units):
    u'''Returns nice string like 'erg/(cm² sec)'.

    >>> print beatiful_units(('erg', 1), ('cm', -2), ('sec', -1))
    erg/(cm² sec)
    '''

This appears to be a known and as yet unresolved issue in Python. 这似乎是Python中一个已知且尚未解决的问题。 See open issues here and here . 请参阅此处此处的未解决问题

Not surprisingly, it can be modified to work OK in Python 3 since all strings are Unicode there: 毫不奇怪,它可以修改为在Python 3中正常工作,因为所有字符串都是Unicode:

def mylen(word):
  """
  >>> mylen("áéíóú")
  5
  """
  return len(word)

print(mylen("áéíóú"))

My solution was to escape the unicode characters, like u'\\xe1\\xe9\\xed\\xf3\\xfa'. 我的解决方案是逃避unicode字符,比如u'\\ xe1 \\ xe9 \\ xed \\ xf3 \\ xfa'。 Wasn't as easy to read though, but my tests only had a few non-ASCII characters so in those cases I put the description to the side as a comment, like "# n with tilde". 虽然不是那么容易阅读,但我的测试只有一些非ASCII字符,所以在这些情况下我把描述放在一边作为注释,比如“#n with tilde”。

As already mentioned, you need to ensure your docstrings are Unicode. 如前所述,您需要确保您的文档字符串是Unicode。

If you can switch to Python 3, then it would work automatically there, as both the source encoding is already utf-8 and the default string type is Unicode. 如果你可以切换到Python 3,那么它会自动工作在那里, 既是源编码已经是UTF-8,默认字符串类型是Unicode。

To achieve the same in Python 2, you need to keep the coding: utf-8 next to which you can either prefix all docstrings with u , or simply add 要在Python 2中实现相同的目标,你需要保持coding: utf-8旁边你可以用u作为所有文档字符串的前缀,或者只是添加

from __future__ import unicode_literals

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM