簡體   English   中英

如何使 str.splitlines 方法不按十六進制字符拆分行?

[英]How to make str.splitlines method not to split line by hex characters?

我正在嘗試使用 str.splitlines() 從 GNU Strings 實用程序解析 output 這是來自 GNU Strings 的原始 output:

279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n

當我使用以下代碼解析 output 時:

process = subprocess.run(['strings', '-o', main_exe], check=True, \
                        stdout=subprocess.PIPE, universal_newlines=True)
output = process.stdout
print(output)
lines = output.splitlines()
for line in lines:
    print(line)

我得到了一個我不期望的結果,它打破了我的進一步解析:

279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=
          N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N

我能以某種方式告訴 splitlines() 方法不在\x0字符上觸發嗎?

期望的結果應該有以偏移量開頭的行(每行開頭的 6 位數字):

279304 9k=pN
279340 9k=PN
279376 9k<LN
279412 9k=N
279448 9k<4N
279592 9k<hN
279628 9k;TN
279664 9k<$N

我認為你實際上得到了預期的結果。 但假設 ASCII 或其任何衍生詞(Latin-x、UTF8 等) '\\x0c'是控制字符 FormFeed,它恰好在此處表示為垂直單行跳轉。

換句話說,我敢打賭,結果文件包含預期的字節,但您的進一步處理會因控制字符而窒息。

str.splitlines()文檔說它將根據包括\\x0c在內的許多線邊界類型來分割線。 如果您只想通過 \\n 顯式拆分,那么您可以str.split('\\n')使用str.split('\\n') 但是請注意,如果您的行以 `\\n 結尾,那么您最終會得到一個空組,如果最后一個索引為空字符串,您可能希望刪除該組。

data = '279304 9k=pN\n 279340 9k=PN\n 279376 9k<LN\n 279412 9k=\x0cN\n 279448 9k<4N\n'
lines = data.split('\n')
if lines[-1] == '':
    lines.pop()
print(lines)
for line in lines:
    print(line)

輸出

['279304 9k=pN', ' 279340 9k=PN', ' 279376 9k<LN', ' 279412 9k=\x0cN', ' 279448 9k<4N']
279304 9k=pN
 279340 9k=PN
 279376 9k<LN
 279412 9k=N
 279448 9k<4N
process = subprocess.run(['strings', '-o', main_exe], check=True, \
                        stdout=subprocess.PIPE, universal_newlines=True)
lines = [line.strip() for line in process.stdout.split('\n') if len(line) > 0]

如果您確實希望在每一行上保留前導空格,請刪除對strip()的調用

您的問題源於使用 Unicode 字符串的splitlines方法,它產生的結果與字節字符串的splitlines方法不同。

這個問題有一個 cpython 的問題,open since 2014: 。 str.splitlines 在非 \r\n 字符上拆分 - 問題 #66428 - python/cpython

下面我添加了一個可移植的分割線splitlines ,它對 Unicode 和字節字符串都使用傳統的 ASCII 換行符,並且在 Python2 和 Python3 下都可以工作。 還提供了一個針對效率愛好者的窮人版本。

  • 在 Python 2 中,類型str是一個 8 位字符串,而 Unicode 字符串的類型為unicode
  • 在 Python 3 中, str類型是 Unicode 字符串,8 位字符串的類型為bytes

盡管 Python 2 和 Python 3 Unicode 和 8 位字符串之間的行拆分沒有實際差異,但在 Python 3 下運行普通代碼時,它更有可能遇到 Unicode 字符串的擴展通用換行方法的問題。

下表顯示了 Python 數據類型采用哪種拆分方法。

分割法 Python 2 Python 3
ASCII碼 海峽分割線 bytes.splitlines
Unicode unicode.splitlines 海峽分割線
str_is_unicode = len('a\fa'.splitlines()) > 1

def splitlines(string): # ||:fnc:||
    r"""Portable definitive ASCII splitlines function.

    In Python 2, type :class:`str` is an 8-bit string and Unicode strings
    have type :class:`unicode`.

    In Python 3, type :class:`str` is a Unicode string and 8-bit strings
    have type :class:`bytes`.

    Although there is no actual difference in line splitting between
    Python 2 and Python 3 Unicode and 8-bit strings, when running
    vanilla code under Python 3, it is more likely to run into trouble
    with the extended `universal newlines`_ approach for Unicode
    strings.

    The following table shows which Python data type employs which
    splitting method.

    +--------------+---------------------------+---------------------------+
    | Split Method | Python 2                  | Python 3                  |
    +==============+===========================+===========================+
    | ASCII        | `str.splitlines <ssl2_>`_ | `bytes.splitlines`_       |
    +--------------+---------------------------+---------------------------+
    | Unicode      | `unicode.splitlines`_     | `str.splitlines <ssl3_>`_ |
    +--------------+---------------------------+---------------------------+
    
    This function provides a portable and definitive method to apply
    ASCII `universal newlines`_ for line splitting. The reencoding is
    performed to take advantage of splitlines' `universal newlines`_
    aproach for Unix, DOS and Macintosh line endings.

    While the poor man's version of simply splitting on \\n might seem
    more performant, it falls short, when a mixture of Unix, DOS and
    Macintosh line endings are encountered. Just for reference, a
    general implementation is presented, which avoids some common
    pitfalls.

    >>> test_strings = (
    ...     "##\ftrail\n##\n\ndone\n\n\n",
    ...     "##\ftrail\n##\n\ndone\n\n\nxx",
    ...     "##\ftrail\n##\n\ndone\n\nx\n",
    ...     "##\ftrail\r##\r\rdone\r\r\r",
    ...     "##\ftrail\r\n##\r\n\r\ndone\r\n\r\n\r\n")

    The global variable :data:`str_is_unicode` determines portably,
    whether a :class:`str` object is a Unicode string.

    .. code-block:: sh

       str_is_unicode = len('a\fa'.splitlines()) > 1

    This allows to define some generic conversion functions:

    >>> if str_is_unicode:
    ...     make_native_str = lambda s, e=None: getattr(s, 'decode', lambda _e: s)(e or 'utf8')
    ...     make_uc_string = make_native_str
    ...     make_u8_string = lambda s, e=None: ((isinstance(s, str) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
    ... else:
    ...     make_native_str = lambda s, e=None: ((isinstance(s, unicode) and (s.encode(e or 'utf8'), 1)) or (s, 1))[0]
    ...     make_u8_string =  make_native_str
    ...     make_uc_string = lambda s, e=None: ((not isinstance(s, unicode) and (s.decode('utf8'), 1)) or (s, 1))[0]

    for a protable doctest:

    >>> for test_string in test_strings:
    ...     print('--------------------')
    ...     print(repr(test_string))
    ...     print(repr([make_native_str(_l) for _l in splitlines(make_u8_string(test_string))]))
    ...     print(repr([make_native_str(_l) for _l in poor_mans_splitlines(make_u8_string(test_string))]))
    ...     print([make_native_str(_l) for _l in splitlines(make_uc_string(test_string))])
    ...     print([make_native_str(_l) for _l in poor_mans_splitlines(make_uc_string(test_string))])
    --------------------
    '##\x0ctrail\n##\n\ndone\n\n\n'
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    --------------------
    '##\x0ctrail\n##\n\ndone\n\n\nxx'
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    ['##\x0ctrail', '##', '', 'done', '', '', 'xx']
    --------------------
    '##\x0ctrail\n##\n\ndone\n\nx\n'
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    ['##\x0ctrail', '##', '', 'done', '', 'x']
    --------------------
    '##\x0ctrail\r##\r\rdone\r\r\r'
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    --------------------
    '##\x0ctrail\r\n##\r\n\r\ndone\r\n\r\n\r\n'
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']
    ['##\x0ctrail', '##', '', 'done', '', '']

    For further details see

    - Python 2: `5. Built-in Types - Python 2.7.18 documentation
      <https://docs.python.org/2.7/library/stdtypes.html>`_
    - Python 3: `Built-in Types - Python 3.10.4 documentation
      <https://docs.python.org/3/library/stdtypes.html>`_

    .. _`universal newlines`: https://docs.python.org/3/glossary.html
    .. _`ssl2`: https://docs.python.org/2.7/library/stdtypes.html#str.splitlines
    .. _`unicode.splitlines`: https://docs.python.org/2.7/library/stdtypes.html#unicode.splitlines
    .. _`ssl3`: https://docs.python.org/3/library/stdtypes.html#str.splitlines      -
    .. _`bytes.splitlines`: https://docs.python.org/3/library/stdtypes.html#bytes.splitlines
    """
    if ((str_is_unicode and isinstance(string, str))
        or (not str_is_unicode and not isinstance(string, str))):
        # unicode string
        u8 = string.encode('utf8')
        lines = u8.splitlines()
        return [l.decode('utf8') for l in lines]
    # byte string
    return string.splitlines()

def poor_mans_splitlines(string):
    r"""
    """
    if str_is_unicode:
        native_uc_type = str
    else:
        native_uc_type = unicode
    if ((str_is_unicode and isinstance(string, str))
        or (not str_is_unicode and isinstance(string, native_uc_type))):
        # unicode string
        sep = '\r\n|\n'
        if not re.search(sep, string):
            sep = '\r'
        else:
            # |:info:|
            # if there is a single newline at the end, `$` matches that newline
            # if there are multiple newlines at the end, `$` matches before the last newline
            string += '\n'
        sep_end = '(' + sep + ')$'
        # prevent additional blank line at end
        string = re.sub(sep_end, '', string)
        return re.split(sep, string)
    # byte string
    return string.splitlines()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM