Bash：使用另一个文件的行查找和替换文件中的行

Question

我有两个文件：具有数百行 URL 的masterlist.txt和toupdate.txt ，其中包含需要替换的masterlist.txt文件中较少数量的更新版本的行。

我希望能够使用 Bash 自动执行此过程，因为这些列表的创建和使用已经在 bash 脚本中进行。

URL 的服务器部分是更改的部分，因此我们可以使用唯一部分进行匹配： /whatever/whatever_user.xml ，但是如何在masterlist.txt中查找和替换这些行？ 即如何通过 toupdate.txt 的每一行toupdate.txt并以/f_SomeName/f_SomeName_user.xml结尾，找到以masterlist.txt结尾并用新行替换整行？

So https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml becomes https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml for example.

masterlist.txt 的masterlist.txt需要保持不变，因此我们必须仅查找和替换具有相同行尾 (ID) 的不同服务器的行。

结构

masterlist.txt看起来像这样：

https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://101112url.domain.com/1/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

toupdate.txt看起来像这样：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml

期望的结果

使masterlist.txt看起来像：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

初步检查

我看过sed但我不知道如何使用这两个文件中的行进行查找和替换？

这是我到目前为止所拥有的，至少进行文件处理：

#!/bin/bash

#...

while read -r line; do
    # there's a new link on each line
    link="${line}"
    # extract the unique part from the end of each line
    grabXML="${link##*/}"
    grabID="${grabXML%_user.xml}"
    # if we cannot grab the ID, then just set it to use the full link so we don't have an empty string
    if [ -n "${grabID}" ]; then
        identifier=${grabID}
    else
        identifier="${line}"
    fi
    
    ## the find and replace here? ##    

# we're done when we've reached the end of the file
done < "masterlist.txt"

Answer 1

请您尝试以下方法：

#!/bin/bash

declare -A map
while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        map[$uniq_part]=$line
    fi
done < "toupdate.txt"

while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        if [[ -n ${map[$uniq_part]} ]]; then
            line=${map[$uniq_part]}
        fi
    fi
    echo "$line"
done < "masterlist.txt" > "masterlist_tmp.txt"

# if the result of "masterlist_tmp.txt" is good enough, uncomment the line below
# mv -f -- "masterlist_tmp.txt" "masterlist.txt"

结果：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml

[说明]

The associative array map maps the "unique part" such as /f_SomeName/f_SomeName_user.xml to the "full path" such as https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml .
正则表达式(/[^/]+/[^/]*\.xml)$ ，如果匹配，则将 shell 变量BASH_REMATCH[1]分配给 substring 从第二个最右边的斜杠到末尾的扩展“.xml”的字符串。
在文件“toupdate.txt”的第一个循环中，它生成“唯一部分”和“填充路径”对作为关联数组的键值对。
在文件“masterlist.txt”的第二个循环中，测试提取的“唯一部分”是否存在关联值。 如果是这样，则该行将替换为关联的值，即“toupdate.txt”文件中的行。

[选择]
如果文本文件很大， bash可能不够快。 在这种情况下， awk脚本将更有效地工作：

awk 'NR==FNR {
    if (match($0, "/[^/]+/[^/]*\\.xml$")) {
        map[substr($0, RSTART, RLENGTH)] = $0
    }
    next
}
{
    if (match($0, "/[^/]+/[^/]*\\.xml$")) {
        full_path = map[substr($0, RSTART, RLENGTH)]
        if (full_path != "") {
            $0 = full_path
        }
    }
    print
}' "toupdate.txt" "masterlist.txt" > "masterlist_tmp.txt"

[说明]

NR==FNR { BLOCK1; next } { BLOCK2 } NR==FNR { BLOCK1; next } { BLOCK2 }语法是为每个文件单独切换处理的常用习惯用法。 由于NR==FNR条件仅适用于参数列表中的第一个文件，并且next一条语句跳过下一个块，因此 BLOCK1 仅处理文件“ BLOCK1 ”。 同样， BLOCK2仅处理文件“masterlist.txt”。
If the function match($0, pattern) succeeds, it sets the awk variable RSTART to the start position of the matched substring out of $0 , the current record read from the file, then sets the variable RLENGTH to the length of the matched substring. 现在我们可以使用substr() function 提取匹配的 substring 例如/f_SomeName/f_SomeName_user.xml 。
然后我们分配数组map以便 substring（唯一部分）映射到“toupdate.txt”中的整个 url。
第二个块的工作方式与第一个块大部分相似。 如果在数组map中找到与键对应的值，则将记录 ($0) 替换为键索引的数组的值。

Answer 2

为什么不让sed编写自己的脚本 - 生成所需的 output，

sed -e "$(sed -e 's<^\(http[s]*://[^/]*/[^/]*/\)\(.*\)<\\|\2\$| s|.*|\1\2|<' toupdate.txt)" masterlist.txt

在哪里

内部sed命令有一个外部和s内部替换命令
external s ( s<...<...< ) 将 scheme://domain/N/ 捕获为\1并将 rest-of-path $.*$ ) 捕获为\2并将它们插入到外部脚本中sed命令
外部sed脚本（ \|\2$| s|.*|\1\2| ）在masterlist.txt中查找以 rest-of-path 结尾的 URL，替换（内部s ）来自 toupdate.txt 的新toupdate.txt
避免大量反斜杠转义<和| 用作两个s命令的分隔符， \|...| 用于/.../

Bash：使用另一个文件的行查找和替换文件中的行

问题描述

结构

期望的结果

初步检查

2 个解决方案

解决方案1
2 已采纳 2021-01-24 02:45:49

解决方案2
2 2021-01-24 10:47:01

Bash：使用另一个文件的行查找和替换文件中的行

问题描述

结构

期望的结果

初步检查

2 个解决方案

解决方案1 2 已采纳 2021-01-24 02:45:49

解决方案2 2 2021-01-24 10:47:01

解决方案1
2 已采纳 2021-01-24 02:45:49

解决方案2
2 2021-01-24 10:47:01