简体   繁体   中英

How to search and replace long HTML in multiple files on Linux

I need to recursively find all files with this HTML:

<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
    <meta charset="utf-8">
    <meta name="google" value="notranslate">

And replace it with this HTML:

<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
    <meta charset="utf-8">
    <meta name="google" value="notranslate">
    <meta name="format-detection" content="telephone=no">
    <meta name="format-detection" content="date=no">
    <meta name="format-detection" content="address=no">
    <meta name="format-detection" content="email=no">

This is my unsuccessful attempt of a grep command piped to a sed:

grep --include="index.html" -PRwzl -e '<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>\n    <meta charset="utf-8">\n    <meta name="google" value="notranslate">\n' | xargs -i@ sed -i 's/<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>\n    <meta charset="utf-8">\n    <meta name="google" value="notranslate">\n/<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>\n    <meta charset="utf-8">\n    <meta name="google" value="notranslate">\n    <meta name="google" value="notranslate">\n    <meta name="format-detection" content="telephone=no">\n    <meta name="format-detection" content="date=no">\n    <meta name="format-detection" content="address=no">\n    <meta name="format-detection" content="email=no">\n/g' @

The grep command alone works perfectly.

For clarity, here is the command split into many sections.:

grep --include="index.html" \
    -PRwzl \
    -e '<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
        \n    <meta charset="utf-8">
        \n    <meta name="google" value="notranslate">
        \n' \
    | xargs -i@ sed -i 's/<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
                            \n    <meta charset="utf-8">
                            \n    <meta name="google" value="notranslate">
                            \n
                        /<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
                            \n    <meta charset="utf-8">
                            \n    <meta name="google" value="notranslate">
                            \n    <meta name="google" value="notranslate">
                            \n    <meta name="format-detection" content="telephone=no">
                            \n    <meta name="format-detection" content="date=no">
                            \n    <meta name="format-detection" content="address=no">
                            \n    <meta name="format-detection" content="email=no">
                            \n
                        /g' @
                        

Your command is very complex for nothing. You can run your sed on the file, without the grep and xargs before. Typically an inline edit of a file with sed looks like:

sed -i 's/TO_FIND/REPLACE/' FILE.txt

Another comment, sed is not a great tool to edit HTML. Look at RegEx match open tags except XHTML self-contained tags .

That being said I propose this script to meet your requirement.

#!/bin/bash
#
find . -type f -name "*.html" -print0 | while IFS= read -r -d '' file
do
    if [[ $(grep -c 'id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"' $file) -ne 0 ]]
    then
        # Add the content...
        echo "Adding in file $file"
        sed -i 's#</head>#    <meta name="format-detection" content="telephone=no">\n    <meta name="format-detection" content="date=no">\n    <meta name="format-detection" content="address=no">\n    <meta name="format-detection" content="email=no">\n    </head>#' "$file"
    else
        echo "Nothing to do on $file"
    fi
done
  • Using find with while and read cover cases where you have HTML files in sub-directories.
  • The grep has been highly simplified. If the id and class values are present, it is enough to identify valid files.
  • Then in the sed , you can just add the new lines. Your sed replaced lines with these same lines.
  • I used # as a separator in sed instead of / to avoid confusion with HTML code.
  • This is based on a file I created myself, since you did not provide a sample. You should provide samples in your questions.
  • The order of tags within the <head> section is not relevant, so adding lines just before the closing </head> works.
  • Obviously the else section is optional.
  • <opinion> I find this type of script easier to understand and debug in the future than long single liners. </opinion> .

Assuming that index.html is:

<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
    <meta charset="utf-8">
    <meta name="google" value="notranslate">
    <title>TITRE</title>
</head>
    <body>
        <p>PARAGRAPH</p>
    </body>
</html>

The result is:

<html id="blx-5fb3c619e82a2863d6567c52-000000001" class="blx-5fb3c619e82a2863d6567c52"><head>
    <meta charset="utf-8">
    <meta name="google" value="notranslate">
    <title>TITRE</title>
    <meta name="format-detection" content="telephone=no">
    <meta name="format-detection" content="date=no">
    <meta name="format-detection" content="address=no">
    <meta name="format-detection" content="email=no">
    </head>
    <body>
        <p>PARAGRAPH</p>
    </body>
</html>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM