简体   繁体   English

如何使 str.splitlines 方法不按十六进制字符拆分行?

[英]How to make str.splitlines method not to split line by hex characters?

I'm trying to parse output from GNU Strings utility with str.splitlines() Here is the raw output from GNU Strings:我正在尝试使用 str.splitlines() 从 GNU Strings 实用程序解析 output 这是来自 GNU Strings 的原始 output:

279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n

When I parse the output with the following code:当我使用以下代码解析 output 时:

process = subprocess.run(['strings', '-o', main_exe], check=True, \
                        stdout=subprocess.PIPE, universal_newlines=True)
output = process.stdout
print(output)
lines = output.splitlines()
for line in lines:
    print(line)

I get a result that I don't expect and it breaks my further parsing:我得到了一个我不期望的结果,它打破了我的进一步解析:

279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=
          N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N

Can I somehow tell the splitlines() method not trigger on \x0 characters?我能以某种方式告诉 splitlines() 方法不在\x0字符上触发吗?

The desired result should have lines which starts with an offset (that 6 digits at the start of each line):期望的结果应该有以偏移量开头的行(每行开头的 6 位数字):

279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N

I think that you actually get the expected result.我认为你实际上得到了预期的结果。 But assuming ASCII or any of its derevatives (Latin-x, UTF8, etc.) '\\x0c' is the control character FormFeed which happens to be represented here as a vertical one line jump.但假设 ASCII 或其任何衍生词(Latin-x、UTF8 等) '\\x0c'是控制字符 FormFeed,它恰好在此处表示为垂直单行跳转。

Said differently I would bet a coin that the resulting file contains the expected bytes, but that your further processing chokes on the control character.换句话说,我敢打赌,结果文件包含预期的字节,但您的进一步处理会因控制字符而窒息。

The documentation for str.splitlines() says it will split lines based on a number of line boundary types including \\x0c . str.splitlines()文档说它将根据包括\\x0c在内的许多线边界类型来分割线。 If you only want to explicitly split by \\n then you could user str.split('\\n') instead.如果您只想通过 \\n 显式拆分,那么您可以str.split('\\n')使用str.split('\\n') However note that if your line ends with a `\\n then you will end up with an empty group that you might want to drop the last index if its empty string.但是请注意,如果您的行以 `\\n 结尾,那么您最终会得到一个空组,如果最后一个索引为空字符串,您可能希望删除该组。

data = '279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n'
lines = data.split('\n')
if lines[-1] == '':
    lines.pop()
print(lines)
for line in lines:
    print(line)

OUTPUT输出

['279304 9k=pN', ' 279340 9k=PN', ' 279376 9k<LN', ' 279412 9k=\x0cN', ' 279448 9k<4N']
279304 9k=pN
 279340 9k=PN
 279376 9k<LN
 279412 9k=N
 279448 9k<4N
process = subprocess.run(['strings', '-o', main_exe], check=True, \
                        stdout=subprocess.PIPE, universal_newlines=True)
lines = [line.strip() for line in process.stdout.split('\n') if len(line) > 0]

Remove the call to strip() if you do want to keep that leading whitespace on every line如果您确实希望在每一行上保留前导空格,请删除对strip()的调用

Your problem arises from using the splitlines method of Unicode strings, which produces different results than the splitlines method of byte strings.您的问题源于使用 Unicode 字符串的splitlines方法,它产生的结果与字节字符串的splitlines方法不同。

There is an issue for cpython for this problem, open since 2014: .这个问题有一个 cpython 的问题,open since 2014: 。 str.splitlines splitting on non-\r\n characters - Issue #66428 - python/cpython . str.splitlines 在非 \r\n 字符上拆分 - 问题 #66428 - python/cpython

Below I have added a portable splitlines function that uses the traditional ASCII line break characters for both Unicode and byte strings and works both under Python2 and Python3.下面我添加了一个可移植的分割线splitlines ,它对 Unicode 和字节字符串都使用传统的 ASCII 换行符,并且在 Python2 和 Python3 下都可以工作。 A poor man's version for efficiency enthusiasts is also provided.还提供了一个针对效率爱好者的穷人版本。

  • In Python 2, type str is an 8-bit string and Unicode strings have type unicode .在 Python 2 中,类型str是一个 8 位字符串,而 Unicode 字符串的类型为unicode
  • In Python 3, type str is a Unicode string and 8-bit strings have type bytes .在 Python 3 中, str类型是 Unicode 字符串,8 位字符串的类型为bytes

Although there is no actual difference in line splitting between Python 2 and Python 3 Unicode and 8-bit strings, when running vanilla code under Python 3, it is more likely to run into trouble with the extended universal newlines approach for Unicode strings.尽管 Python 2 和 Python 3 Unicode 和 8 位字符串之间的行拆分没有实际差异,但在 Python 3 下运行普通代码时,它更有可能遇到 Unicode 字符串的扩展通用换行方法的问题。

The following table shows which Python data type employs which splitting method.下表显示了 Python 数据类型采用哪种拆分方法。

Split Method分割法 Python 2 Python 2 Python 3 Python 3
ASCII ASCII码 str.splitlines 海峡分割线 bytes.splitlines bytes.splitlines
Unicode Unicode unicode.splitlines unicode.splitlines str.splitlines 海峡分割线
str_is_unicode = len('a\fa'.splitlines()) > 1

def splitlines(string): # ||:fnc:||
    r"""Portable definitive ASCII splitlines function.

    In Python 2, type :class:`str` is an 8-bit string and Unicode strings
    have type :class:`unicode`.

    In Python 3, type :class:`str` is a Unicode string and 8-bit strings
    have type :class:`bytes`.

    Although there is no actual difference in line splitting between
    Python 2 and Python 3 Unicode and 8-bit strings, when running
    vanilla code under Python 3, it is more likely to run into trouble
    with the extended `universal newlines`_ approach for Unicode
    strings.

    The following table shows which Python data type employs which
    splitting method.

    +--------------+---------------------------+---------------------------+
    | Split Method | Python 2                  | Python 3                  |
    +==============+===========================+===========================+
    | ASCII        | `str.splitlines <ssl2_>`_ | `bytes.splitlines`_       |
    +--------------+---------------------------+---------------------------+
    | Unicode      | `unicode.splitlines`_     | `str.splitlines <ssl3_>`_ |
    +--------------+---------------------------+---------------------------+
    
    This function provides a portable and definitive method to apply
    ASCII `universal newlines`_ for line splitting. The reencoding is
    performed to take advantage of splitlines' `universal newlines`_
    aproach for Unix, DOS and Macintosh line endings.

    While the poor man's version of simply splitting on \\n might seem
    more performant, it falls short, when a mixture of Unix, DOS and
    Macintosh line endings are encountered. Just for reference, a
    general implementation is presented, which avoids some common
    pitfalls.

    >>> test_strings = (
    ...     "##\ftrail\n##\n\ndone\n\n\n",
    ...     "##\ftrail\n##\n\ndone\n\n\nxx",
    ...     "##\ftrail\n##\n\ndone\n\nx\n",
    ...     "##\ftrail\r##\r\rdone\r\r\r",
    ...     "##\ftrail\r\n##\r\n\r\ndone\r\n\r\n\r\n")

    The global variable :data:`str_is_unicode` determines portably,
    whether a :class:`str` object is a Unicode string.

    .. code-block:: sh

       str_is_unicode = len('a\fa'.splitlines()) > 1

    This allows to define some generic conversion functions:

    >>> if str_is_unicode:
    ...     make_native_str = lambda s, e=None: getattr(s, 'decode', lambda _e: s)(e or 'utf8')
    ...     make_uc_string = make_native_str
    ...     make_u8_string = lambda s, e=None: ((isinstance(s, str) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
    ... else:
    ...     make_native_str = lambda s, e=None: ((isinstance(s, unicode) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
    ...     make_u8_string =  make_native_str
    ...     make_uc_string = lambda s, e=None: ((not isinstance(s, unicode) and (s.decode('utf8'), 1)) or (s, 1))[0]

    for a protable doctest:

    >>> for test_string in test_strings:
    ...     print('--------------------')
    ...     print(repr(test_string))
    ...     print(repr([make_native_str(_l) for _l in splitlines(make_u8_string(test_string))]))
    ...     print(repr([make_native_str(_l) for _l in poor_mans_splitlines(make_u8_string(test_string))]))
    ...     print([make_native_str(_l) for _l in splitlines(make_uc_string(test_string))])
    ...     print([make_native_str(_l) for _l in poor_mans_splitlines(make_uc_string(test_string))])
    --------------------
    '##\x0ctrail\n##\n\ndone\n\n\n'
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    --------------------
    '##\x0ctrail\n##\n\ndone\n\n\nxx'
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    --------------------
    '##\x0ctrail\n##\n\ndone\n\nx\n'
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    --------------------
    '##\x0ctrail\r##\r\rdone\r\r\r'
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    --------------------
    '##\x0ctrail\r\n##\r\n\r\ndone\r\n\r\n\r\n'
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']

    For further details see

    - Python 2: `5. Built-in Types - Python 2.7.18 documentation
      <https://docs.python.org/2.7/library/stdtypes.html>`_
    - Python 3: `Built-in Types - Python 3.10.4 documentation
      <https://docs.python.org/3/library/stdtypes.html>`_

    .. _`universal newlines`: https://docs.python.org/3/glossary.html
    .. _`ssl2`: https://docs.python.org/2.7/library/stdtypes.html#str.splitlines
    .. _`unicode.splitlines`: https://docs.python.org/2.7/library/stdtypes.html#unicode.splitlines
    .. _`ssl3`: https://docs.python.org/3/library/stdtypes.html#str.splitlines      -
    .. _`bytes.splitlines`: https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
    """
    if ((str_is_unicode and isinstance(string, str))
        or (not str_is_unicode and not isinstance(string, str))):
        # unicode string
        u8 = string.encode('utf8')
        lines = u8.splitlines()
        return [l.decode('utf8') for l in lines]
    # byte string
    return string.splitlines()

def poor_mans_splitlines(string):
    r"""
    """
    if str_is_unicode:
        native_uc_type = str
    else:
        native_uc_type = unicode
    if ((str_is_unicode and isinstance(string, str))
        or (not str_is_unicode and isinstance(string, native_uc_type))):
        # unicode string
        sep = '\r\n|\n'
        if not re.search(sep, string):
            sep = '\r'
        else:
            # |:info:|
            # if there is a single newline at the end, `$` matches that newline
            # if there are multiple newlines at the end, `$` matches before the last newline
            string += '\n'
        sep_end = '(' + sep + ')$'
        # prevent additional blank line at end
        string = re.sub(sep_end, '', string)
        return re.split(sep, string)
    # byte string
    return string.splitlines()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM