在 python 中搜索并用浮点表示替换特定字符串

Question

Problem: I'm trying to replace mutiple specific sequences in a string with another mutiple specific sequences of floating point representations in Python.问题：我试图用 Python 中的另一个浮点表示的多个特定序列替换字符串中的多个特定序列。

I have an array of strings in a JSON-file, which I load to a python script via the json-module.我在 JSON 文件中有一个字符串数组，我通过 json 模块将其加载到 python 脚本。 The array of strings:字符串数组：

{
  "LinesToReplace": [
    "_ __ ___ ____ _____ ______ _______      ",
    "_._ __._ ___._ ____._ _____._ ______._  ",
    "_._ _.__ _.___ _.____ _._____ _.______  ",
    "_._ __.__ ___.___ ____.____ _____._____ ",
    "_. __. ___. ____. _____. ______.        "
  ]
}

I load the JSON-file via the json-module:我通过 json 模块加载 JSON 文件：

with open("myFile.json") as jsonFile:
  data = json.load(jsonFile)

I'm trying to replace the sequences of _ with specific substrings of floating point representations.我正在尝试用浮点表示的特定子字符串替换_的序列。

Specification:规格：

Character to find in a string must be a single _ or a sequence of multiple _ .要在字符串中查找的字符必须是单个_或多个_的序列。
The length of the sequence of _ is unknown. _序列的长度未知。
If a single _ or a sequence of multiple _ is followed by a .如果单个_或多个_的序列后跟一个. , which is again followed by a single _ or a sequence of multiple _ , the . , 后面又跟着一个_或多个_的序列， . is part of the _ -sequence.是_序列的一部分。
The .的. is used to specify decimals用于指定小数
If the .如果. isn't followed by a single _ or a sequence of multiple _ , the .后面没有单个_或多个_的序列， . is not part of the _ -sequence.不是_序列的一部分。
The sequence of _ and . _和的序列. is to be replaced by floating point representations, ie, %f1.0 .将替换为浮点表示，即%f1.0 。
The representations are dependent on the _ - and .表示取决于_ - 和. -sequences. -序列。

Examples:例子：

__ is to be replaced by %f2.0 . __将被替换为%f2.0 。
_.___ is to be replaced by %f1.3 . _.___将被替换为%f1.3 。
____.__ is to be replaced by %f4.2 . ____.__将被替换为%f4.2 。
___. is to be replaced by %3.0 .将被替换为%3.0 。

For the above JSON-file, the result should be:对于上面的 JSON 文件，结果应该是：

{
  "ReplacedLines": [
    "%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0      ",
    "%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1  ",
    "%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6  ",
    "%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ",
    "%f1.0. %f.0. %f3.0. %f4.0. %f5.0. %f6.0.        "
  ]
}

Some code, which tries to replace single _ with %f1.0 (that doesn't work...):一些代码，它试图用%f1.0替换单个_ （这不起作用......）：

with open("myFile.json") as jsonFile:
  data = json.load(jsonFile)
  strToFind = "_"
  
  for line in data["LinesToReplace"]:
    for idl, l in enumerate(line):
      if (line[idl] == strToFind and line[idl+1] != ".") and (line[idl+1] != strToFind and line[idl-1] != strToFind):
        l = l[:idl] + "%f1.0" + l[idl+1:] # replace string

Any ideas on how to do this?关于如何做到这一点的任何想法？ I have also though about using regular expressions.我也考虑过使用正则表达式。

EDIT编辑

The algorithm should be able to check if the character is a "_", ie to be able to format this:该算法应该能够检查字符是否为“_”，即能够格式化：

{
  "LinesToReplace": [
    "Ex1:_ Ex2:_. Ex3:._ Ex4:_._ Ex5:_._.    ",
    "Ex6:._._ Ex7:._._. Ex8:__._ Ex9: _.__   ",
    "Ex10: _ Ex11: _. Ex12: ._ Ex13: _._     ",
    "Ex5:._._..Ex6:.._._.Ex7:.._._._._._._._."
  ]
}

Solution:解决方案：

{
  "LinesToReplace": [
    "Ex1:%f1.0 Ex2:%f1.0. Ex3:.%f1.0 Ex4:%f1.1 Ex5:%f1.1.    ",
    "Ex6:.%f1.1 Ex7:.%f1.1. Ex8:%f2.1 Ex9: %f1.2   ",
    "Ex10: %f1.0 Ex11: %f1.0. Ex12: .%f1.0 Ex13: %f1.1     ",
    "Ex5:.%f1.1..Ex6:..%f1.1.Ex7:..%f1.1.%f1.1.%f1.1.%f1.0."
  ]
}

I have tried the following algorithm based on the above criteria, but I can't figure out how to implement it:我已经根据上述标准尝试了以下算法，但我无法弄清楚如何实现它：

def replaceFunc3(lines: list[str]) -> list[str]:
    result = []
    charToFind = '_'
    charMatrix = []

    # Find indicies of all "_" in lines
    for line in lines:
        charIndices = [idx for idx, c in enumerate(line) if c == charToFind]
        charMatrix.append(charIndices)

    for (line, char) in zip(lines, charMatrix):
        if not char: # No "_" in current line, append the whole line
            result.append(line)
    else:
        pass
        # result.append(Something)
        # TODO: Insert "%fx.x on all the placeholders"

    return result

Answer 1

Neat problem.整洁的问题。 Personally, here is how I would do it:就个人而言，我会这样做：

from pprint import pprint

d = {
    "LinesToReplace": [
        "_ __ ___ ____ _____ ______ _______      ",
        "_._ __._ ___._ ____._ _____._ ______._  ",
        "_._ _.__ _.___ _.____ _._____ _.______  ",
        "_._ __.__ ___.___ ____.____ _____._____ ",
        "_. __. ___. ____. _____. ______.        "
    ]
}


def get_replaced_lines(lines: list[str]) -> list[str]:
    result = []

    for line in lines:
        trimmed_line = line.rstrip()
        trailing_spaces = len(line) - len(trimmed_line)

        underscores = trimmed_line.split()
        repl_line = []

        for s in underscores:
            n = len(s)

            if '.' in s:
                if s.endswith('.'):
                    repl_line.append(f'%f{n - 1}.0.')
                else:
                    idx = s.index('.')
                    repl_line.append(f'%f{idx}.{n - idx - 1}')

            else:
                repl_line.append(f'%f{n}.0')

        result.append(' '.join(repl_line) + ' ' * trailing_spaces)

    return result


if __name__ == '__main__':
    pprint(get_replaced_lines(d['LinesToReplace']))

Output: Output：

['%f1.0 %f2.0 %f3.0 %f4.0 %f5.0 %f6.0 %f7.0      ',
 '%f1.1 %f2.1 %f3.1 %f4.1 %f5.1 %f6.1  ',
 '%f1.1 %f1.2 %f1.3 %f1.4 %f1.5 %f1.6  ',
 '%f1.1 %f2.2 %f3.3 %f4.4 %f5.5 ',
 '%f1.0. %f2.0. %f3.0. %f4.0. %f5.0. %f6.0.        ']

If curious, I've also timed it at the alternate regex approach, and found this to be 40% faster overall.如果好奇的话，我还用备用正则表达式方法对它进行了计时，发现它总体上快了 40% 。 I only like this test because it proves that in general, regex is a little slower than just doing it by hand.我只喜欢这个测试，因为它证明一般来说，正则表达式比手工做要慢一点。 Though the regex approach is nice because it is certainly shorter:-)虽然正则表达式方法很好，因为它肯定更短:-)

Here is my test code:这是我的测试代码：

import re
from timeit import timeit

d = {
    "LinesToReplace": [
        "_ __ ___ ____ _____ ______ _______      ",
        "_._ __._ ___._ ____._ _____._ ______._  ",
        "_._ _.__ _.___ _.____ _._____ _.______  ",
        "_._ __.__ ___.___ ____.____ _____._____ ",
        "_. __. ___. ____. _____. ______.        "
    ]
}


def get_replaced_lines(lines: list[str]) -> list[str]:
    result = []
    dot = '.'
    space = ' '

    for line in lines:
        trimmed_line = line.rstrip()
        trailing_spaces = len(line) - len(trimmed_line)

        underscores = trimmed_line.split()
        repl_line = []

        for s in underscores:
            n = len(s)

            if dot in s:
                if s[n - 1] == dot:  # if last character is a '.'
                    repl_line.append(f'%f{n - 1}.0.')
                else:
                    idx = s.index(dot)
                    repl_line.append(f'%f{idx}.{n - idx - 1}')

            else:
                repl_line.append(f'%f{n}.0')

        result.append(space.join(repl_line) + space * trailing_spaces)

    return result


def get_replaced_lines_regex(lines_to_replace):
    return [re.sub(
        '(_+)([.]_+)?',
        lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
        line,
    ) for line in lines_to_replace]


if __name__ == '__main__':
    n = 100_000

    time_1 = timeit("get_replaced_lines(d['LinesToReplace'])", number=n, globals=globals())
    time_2 = timeit("get_replaced_lines_regex(d['LinesToReplace'])", number=n, globals=globals())

    print(f'get_replaced_lines:        {time_1:.3f}')
    print(f'get_replaced_lines_regex:  {time_2:.3f}')

    print(f'The first (non-regex) approach is faster by {(1 - time_1 / time_2) * 100:.2f}%')

    assert get_replaced_lines(d['LinesToReplace']) == get_replaced_lines_regex(d['LinesToReplace'])

Results on my M1 Mac:在我的 M1 Mac 上的结果：

get_replaced_lines:        0.813
get_replaced_lines_regex:  1.359
The first (non-regex) approach is faster by 40.14%

Answer 2

You can use regular expression's re.sub together with a replacement function that performs the logic on the capture groups:您可以将正则表达式的re.sub与执行捕获组逻辑的替换 function 一起使用：

import re

def replace(line):
    return re.sub(
        '(_+)([.]_+)?',
        lambda m: f'%f{len(m.group(1))}.{len(m.group(2) or ".")-1}',
        line,
    )

lines = [replace(line) for line in lines_to_replace]

Explanation of regex:正则解释：

(_+) matches one or more underscores; (_+)匹配一个或多个下划线； the () part makes them available as a capture group (the first such group, ie m.group(1) ). ()部分使它们可用作捕获组（第一个这样的组，即m.group(1) ）。
([.]_+)? optionally matches a dot followed by one or more trailing underscores (made optional by the trailing ? );可选地匹配一个点后跟一个或多个尾随下划线（由尾随?变为可选）； the dot is part of a character class ( [] ) because otherwise it would have the special meaning "any character" .点是字符 class ( [] ) 的一部分，否则它将具有特殊含义“任何字符” 。 The () make this part available as the second capture group ( m.group(2) ). ()使此部分可用作第二个捕获组 ( m.group(2) )。

在 python 中搜索并用浮点表示替换特定字符串

问题描述

2 个解决方案

解决方案1
1 2022-05-12 19:32:39

解决方案2
1 已采纳 2022-05-12 19:45:53

在 python 中搜索并用浮点表示替换特定字符串

问题描述

2 个解决方案

解决方案1 1 2022-05-12 19:32:39

解决方案2 1 已采纳 2022-05-12 19:45:53

解决方案1
1 2022-05-12 19:32:39

解决方案2
1 已采纳 2022-05-12 19:45:53