简体   繁体   English

Bash:使用另一个文件的行查找和替换文件中的行

[英]Bash: Find and replace lines in a file using the lines of another file

I have two files: masterlist.txt that has hundreds of lines of URLs, and toupdate.txt that has a smaller number of updated versions of lines from the masterlist.txt file that need to be replaced.我有两个文件:具有数百行 URL 的masterlist.txttoupdate.txt ,其中包含需要替换的masterlist.txt文件中较少数量的更新版本的行。

I'd like to be able to automate this process using Bash, since the creation and utilisation of these lists is already occuring in a bash script.我希望能够使用 Bash 自动执行此过程,因为这些列表的创建和使用已经在 bash 脚本中进行。

The server part of the URL is the part that changes, so we could match using the unique part: /whatever/whatever_user.xml , but how to find and replace those lines in masterlist.txt ? URL 的服务器部分是更改的部分,因此我们可以使用唯一部分进行匹配: /whatever/whatever_user.xml ,但是如何在masterlist.txt中查找和替换这些行? ie how to go through each line of toupdate.txt and as it ends in /f_SomeName/f_SomeName_user.xml , find that ending in masterlist.txt and replace that whole line with the new one?即如何通过 toupdate.txt 的每一行toupdate.txt并以/f_SomeName/f_SomeName_user.xml结尾,找到以masterlist.txt结尾并用新行替换整行?

So https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml becomes https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml for example. So https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml becomes https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml for example.

The rest of masterlist.txt needs to stay intact, so we must only find and replace lines that have different servers for the same line endings (IDs). masterlist.txt 的masterlist.txt需要保持不变,因此我们必须仅查找和替换具有相同行尾 (ID) 的不同服务器的行。

Structure结构

masterlist.txt looks like this: masterlist.txt看起来像这样:

https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://101112url.domain.com/1/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

toupdate.txt looks like this: toupdate.txt看起来像这样:

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml

Desired Result期望的结果

Make masterlist.txt look like:使masterlist.txt看起来像:

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

Initial workup初步检查

I've looked at sed but I don't know how to do the find and replace using lines from the two files?我看过sed但我不知道如何使用这两个文件中的行进行查找和替换?

Here's what I have so far, doing the file handling at least:这是我到目前为止所拥有的,至少进行文件处理:

#!/bin/bash

#...

while read -r line; do
    # there's a new link on each line
    link="${line}"
    # extract the unique part from the end of each line
    grabXML="${link##*/}"
    grabID="${grabXML%_user.xml}"
    # if we cannot grab the ID, then just set it to use the full link so we don't have an empty string
    if [ -n "${grabID}" ]; then
        identifier=${grabID}
    else
        identifier="${line}"
    fi
    
    ## the find and replace here? ##    

# we're done when we've reached the end of the file
done < "masterlist.txt"

Would you please try the following:请您尝试以下方法:

#!/bin/bash

declare -A map
while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        map[$uniq_part]=$line
    fi
done < "toupdate.txt"

while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        if [[ -n ${map[$uniq_part]} ]]; then
            line=${map[$uniq_part]}
        fi
    fi
    echo "$line"
done < "masterlist.txt" > "masterlist_tmp.txt"

# if the result of "masterlist_tmp.txt" is good enough, uncomment the line below
# mv -f -- "masterlist_tmp.txt" "masterlist.txt"

result:结果:

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml

[Explanations] [说明]

  • The associative array map maps the "unique part" such as /f_SomeName/f_SomeName_user.xml to the "full path" such as https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml . The associative array map maps the "unique part" such as /f_SomeName/f_SomeName_user.xml to the "full path" such as https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml .
  • The regex (/[^/]+/[^/]*\.xml)$ , if matched, assigns the shell variable BASH_REMATCH[1] to the substring from the second rightmost slash to the extention ".xml" at the end of the string.正则表达式(/[^/]+/[^/]*\.xml)$ ,如果匹配,则将 shell 变量BASH_REMATCH[1]分配给 substring 从第二个最右边的斜杠到末尾的扩展“.xml”的字符串。
  • In the first loop on the file "toupdate.txt", it generates "unique part" and "fill path" pairs as key-value pairs of the associative array.在文件“toupdate.txt”的第一个循环中,它生成“唯一部分”和“填充路径”对作为关联数组的键值对。
  • In the second loop on the file "masterlist.txt", the extracted "unique part" is tested if the associated value exists.在文件“masterlist.txt”的第二个循环中,测试提取的“唯一部分”是否存在关联值。 If so, the line is substituted with the associated value, the line in "toupdate.txt" file.如果是这样,则该行将替换为关联的值,即“toupdate.txt”文件中的行。

[Alternative] [选择]
If the text files are large in size, bash may not be fast enough.如果文本文件很大, bash可能不够快。 In such a case, awk script will work more efficiently:在这种情况下, awk脚本将更有效地工作:

awk 'NR==FNR {
    if (match($0, "/[^/]+/[^/]*\\.xml$")) {
        map[substr($0, RSTART, RLENGTH)] = $0
    }
    next
}
{
    if (match($0, "/[^/]+/[^/]*\\.xml$")) {
        full_path = map[substr($0, RSTART, RLENGTH)]
        if (full_path != "") {
            $0 = full_path
        }
    }
    print
}' "toupdate.txt" "masterlist.txt" > "masterlist_tmp.txt"

[Explanations] [说明]

  • The NR==FNR { BLOCK1; next } { BLOCK2 } NR==FNR { BLOCK1; next } { BLOCK2 } NR==FNR { BLOCK1; next } { BLOCK2 } syntax is a common idiom to switch the processing individually for each file. NR==FNR { BLOCK1; next } { BLOCK2 }语法是为每个文件单独切换处理的常用习惯用法。 As the NR==FNR condition meets only for the 1st file in the argument list and next statement skips the following block, BLOCK1 processes the file "toupdate.txt" only.由于NR==FNR条件仅适用于参数列表中的第一个文件,并且next一条语句跳过下一个块,因此 BLOCK1 仅处理文件“ BLOCK1 ”。 Similarly BLOCK2 processes the file "masterlist.txt" only.同样, BLOCK2仅处理文件“masterlist.txt”。
  • If the function match($0, pattern) succeeds, it sets the awk variable RSTART to the start position of the matched substring out of $0 , the current record read from the file, then sets the variable RLENGTH to the length of the matched substring. If the function match($0, pattern) succeeds, it sets the awk variable RSTART to the start position of the matched substring out of $0 , the current record read from the file, then sets the variable RLENGTH to the length of the matched substring. Now we can extract the matched substring such as /f_SomeName/f_SomeName_user.xml by using the substr() function.现在我们可以使用substr() function 提取匹配的 substring 例如/f_SomeName/f_SomeName_user.xml
  • Then we assign the array map so that the substring (the unique part) is mapped to the whole url in "toupdate.txt".然后我们分配数组map以便 substring(唯一部分)映射到“toupdate.txt”中的整个 url。
  • The second block works mostly similar to the first block.第二个块的工作方式与第一个块大部分相似。 If the value corresponding to the key is found in the array map , then the record ($0) is replaced with the value of the array indexed by the key.如果在数组map中找到与键对应的值,则将记录 ($0) 替换为键索引的数组的值。

Why not have sed write its own script - producing the desired output,为什么不让sed编写自己的脚本 - 生成所需的 output,

sed -e "$(sed -e 's<^\(http[s]*://[^/]*/[^/]*/\)\(.*\)<\\|\2\$| s|.*|\1\2|<' toupdate.txt)" masterlist.txt

where在哪里

  • the inner sed command has an outer and an inner s ubstitution command内部sed命令有一个外部和s内部替换命令
  • outer s ( s<...<...< ) captures scheme://domain/N/ as \1 and rest-of-path \(.*\) as \2 and inserts them into a script for the outer sed command external s ( s<...<...< ) 将 scheme://domain/N/ 捕获为\1并将 rest-of-path \(.*\) ) 捕获为\2并将它们插入到外部脚本中sed命令
  • outer sed script ( \|\2$| s|.*|\1\2| ) finds URLs in masterlist.txt ending in rest-of-path, substituting (inner s ) the new URL from toupdate.txt外部sed脚本( \|\2$| s|.*|\1\2| )在masterlist.txt中查找以 rest-of-path 结尾的 URL,替换(内部s )来自 toupdate.txt 的新toupdate.txt
  • to avoid a lot of backslash-escaping < and |避免大量反斜杠转义<| are used as delimiters for the two s commands, and \|...|用作两个s命令的分隔符, \|...| is used for /.../用于/.../

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM