高级多重搜索替换

Question

Problem: I want to batch replace patterns in a file in an advanced way, so I cannot use any standard search and replace tools: 问题：我想以高级方式批量替换文件中的模式，所以我不能使用任何标准的搜索和替换工具：

Let's assume there is file 1: 假设有文件1：

B
B
A
  B
B
B
A
  B
B
A
  B

And I want to replace B by something else. 我想用其他方式代替B。 But only each B, which comes after a A. 但是只有每个B，在A之后。

Here is File 2, which holds the "rules", how to search and replace: 这是文件2，其中包含“规则”以及如何搜索和替换：

A;B;C1
A;B;C2
A;B;C3

The ";" “;” should be the divider. 应该是分隔线。 Can be anything else. 可以是其他任何东西。 Script should search for and A. Then continue to search for B. And replace that B by C1. 脚本应搜索和A。然后继续搜索B。并将该B替换为C1。 Afterwards continue to the next occurence of A. Search for the next B and replace this B by C2. 之后，继续执行下一个出现的A。搜索下一个B，并将其替换为C2。 And so on. 等等。 When the script has replaced B by C3, it should stop, because there is no further rule. 当脚本用C3替换B后，它应该停止，因为没有其他规则了。

Final file should look like: 最终文件应如下所示：

B
B
A
  C1
B
B
A
  C2
B
A
  C3

I want to use python for it, but it is not mandatory, if there is an easier way. 我想为其使用python，但如果有更简单的方法，则它不是强制性的。

Answer 1

You could implement something similar using regular expressions. 您可以使用正则表达式实现类似的功能。 re.finditer returns starting/ending position of match and re.sub accepts parameter how many substitutions should be made. re.finditer返回match的开始/结束位置，并且re.sub接受参数应进行多少次替换。 You can start from this: 您可以从这里开始：

import re

data = '''B
B
A
  B
B
B
A
  B
B
A
  B'''

rules = [
    (r'A.*?(B)', r'C1'),
    (r'A.*?(B)', r'C2'),
    (r'A.*?(B)', r'C3'),
]

startpos = 0
while rules:
    rule = rules.pop(0)
    for g in re.finditer(rule[0], data[startpos:], flags=re.DOTALL):
        data = data[:startpos + g.start(1)] + re.sub(g.group(1), rule[1], data[startpos + g.start(1):], count=1)
        startpos += g.start(1)
        break

print(data)

Prints: 印刷品：

B
B
A
  C1
B
B
A
  C2
B
A
  C3

Answer 2

I started writing a regex based solution, but @Andrej got there first! 我开始写一个基于正则表达式的解决方案，但是@Andrej首先到达那里！ So I present you a more "naive" approach that does not use regex. 因此，我向您展示了一种不使用正则表达式的更“幼稚”的方法。

#!/usr/bin/env python3
import sys


def read_rules(fpath="/tmp/test.rules", sep=";"):
    rules = []
    with open(fpath) as f:
        for line in f:
            rules.append(line.strip().split(sep))

    return rules


def parse_data(rules, fpath="/tmp/test.data"):
    cur_rule = rules[0]
    rule_idx = 0
    data = []
    state = None

    with open(fpath) as f:
        for line in f:
            line = line.strip('\n')
            if not cur_rule:
                data.append(line)
                continue

            # We match start
            if cur_rule[0] in line and not state:
                # End matches in the same line and start < end
                # This case is not in your data
                if (
                    cur_rule[1] in line
                    and line.index(cur_rule[0]) < line.index(cur_rule[1])
                ):
                    new_line = line.replace(cur_rule[1], cur_rule[2], 1)
                    data.append(new_line)
                    rule_idx += 1

                    # We reached the end of rules
                    if len(rules) == rule_idx:
                        cur_rule = None
                    else:
                        cur_rule = rules[rule_idx]
                else:
                    # Set state to looking for end
                    state = 1
                    data.append(line)

                continue

            # Now, if here we are looking for end...
            if state == 1:
                # Nope... not found... move on
                if cur_rule[1] not in line:
                    data.append(line)
                    continue

                # replace
                data.append(
                    line.replace(cur_rule[1], cur_rule[2], 1)
                )

                # Reset state
                state = None

                rule_idx += 1

                # We reached the end of rules
                if len(rules) == rule_idx:
                    cur_rule = None
                else:
                    cur_rule = rules[rule_idx]
                continue

            # Here, no line matched
            data.append(line)


    return data


def main():
    rules = read_rules()
    print(rules)
    data = parse_data(rules)
    print("\n".join(data))


if __name__ == "__main__":
    sys.exit(main())

Explanation: 说明：

This is a line-by-line algorithm which makes it efficient for large datasets 这是一种逐行算法，可有效处理大型数据集
It is "state" based: We either look for "start" (first character) or "end" (second character to match) 它基于“状态”：我们要么寻找“开始”（第一个字符），要么寻找“结束”（第二个字符匹配）
If start is found: 如果找到开始：
- If we have end in the same line, perform replacement and advance to the next rule 如果我们在同一行中结束，请执行替换并前进至下一条规则
- If we do not have end in the same line, change state and move to the next line 如果我们没有在同一行中结束，请更改状态并移至下一行
If we are in state=1 (looking for "end") and we find it in the current line, perform replacement and move to the next rule 如果我们处于状态= 1（查找“结束”），并且在当前行中找到它，则执行替换并移至下一条规则
At any point we advance rule, if we reached the end of rules, set cur_rule to None. 在任何时候我们都会推进规则，如果到达规则末尾，请将cur_rule设置为None。 All lines past that point are just copied from the input to the output without any processing 超过该点的所有行仅从输入复制到输出，而无需任何处理

Pros: 优点：

This should be faster for huge input. 对于大量输入，这应该更快。 Output can also be optimized to be "on-the-fly" and not stored in memory 输出也可以优化为“即时”且不存储在内存中
Easier to follow I think 我认为更容易遵循

Cons: 缺点：

It does not handle all cases, that's why I called it "naive". 它不能处理所有情况，这就是为什么我称其为“天真”。 One example is if you have 2 matches in the same line, or if you match "end" and "start" in the same line (in this order - end first). 一个示例是，如果您在同一行中有2个匹配项，或者在同一行中（按此顺序-先结束）匹配“结束”和“开始”。 It can be adjusted for such cases if necessary but it might get complex and a regex solution becomes more attractive 可以根据需要对这种情况进行调整，但可能会变得复杂，并且正则表达式解决方案变得更具吸引力

Output (note I added one extra match to check that it stops when rules finish): 输出（请注意，我添加了一个额外的匹配项以检查规则完成后是否停止）：

B
B
A
  C1
B
B
A
  C2
B
A
  C3
A
  B

高级多重搜索替换

问题描述

2 个解决方案

解决方案1
1 2019-07-20 09:53:09

解决方案2
1 2019-07-20 10:31:11

高级多重搜索替换

问题描述

2 个解决方案

解决方案1 1 2019-07-20 09:53:09

解决方案2 1 2019-07-20 10:31:11

解决方案1
1 2019-07-20 09:53:09

解决方案2
1 2019-07-20 10:31:11