How to remove `<a href="file://a>`keep this text`</a>` using sed or perl?

Question

How can I remove all <a href="file://???"> keep this text </a> but not the other <a></a> or </a> using sed or perl?
Is:

    <p><a class="a" href="file://any" id="b">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

Should be:

    <p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

I have regex like this but it is too greedy and removes all </a>

gsed -E -i 's/<a*href="file:[^>]*>(.+?)<\/a>/\1>/g' file.xhtml

Answer 1

One way:

sed -E 's,<a.*?href="file://[^>]*>([^<]*)</a>,\1,g'

<a.*?href="file://[^>]*> match <a + anything (non-greedy) followed by href="file:// + any number of non- > characters followed by >
([^<]*) match and capture any number of non- < characters
match on </a>

Everything matched is substituted by the capture in \\1 and the ending g makes it do the substitution on every occurance on each line.

Examples:

$ cat data
<p><a class="a" href="file://any" id="b">keep this text</a>, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p><a href="file://any" class="f">keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

$ sed -E 's,<a.*?href="file://[^>]*>([^<]*)</a>,\1,g' < data
<p>keep this text, <a id="file:ex" href="http://example.com/abc">example.com/abc</a>, more text</p>
<p>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>

Answer 2

Assumptions:

OP does not have access to a HTML-centric tool
remove the <a href="file:..."> ...some_text... </a> wrappers leaving just ...some_text...
only apply to file: entries

Sample data showing multiple file: entries interspersed with some other (nonsensical) entries:

$ cat sample.html
<p><a href="https:/google.com">some text</a><a href="file://any" >keep this text</a>, <a href="http://example.com/abc">example.com/abc</a>, more text</p><a href="file://anyother" >keep this text,too</a>, last test</p>

One sed idea to remove the wrappers for all file: entries:

sed -E 's|<a[^<>]+file:[^>]+>([^<]+)</a>|\1|g' "${infile}"

NOTE: perhaps a bit overkill with some of the [^..] entries but the key objective is to short circuit sed's default greedy matching ...

This leaves:

<p><a href="https:/google.com">some text</a>keep this text, <a href="http://example.com/abc">example.com/abc</a>, more text</p>keep this text,too, last test</p>

Answer 3

Considering the case the <a> tag consists of the content in multiple lines, how about a perl solution:

perl -0777 -i -pe 's#<a.+?href="?file.+?>(.+?)</a>#$1#gs' file.xhtml

The -0777 option tells perl to slurp the whole file.
The -i option enables the in-place editing.
The s switch at the end of s operator makes a dot match any characters including a newline character.
The regex .+? is the non-greedy version of .+ to enable the shortest match.

How to remove `<a href="file://a>`keep this text`</a>` using sed or perl?

Question

3 answers

solution1
0 2021-11-04 21:38:28

solution2
0 2021-11-04 22:20:31

solution3
0 2021-11-05 01:07:40

How to remove `<a href="file://a>`keep this text`</a>` using sed or perl?

Question

3 answers

solution1 0 2021-11-04 21:38:28

solution2 0 2021-11-04 22:20:31

solution3 0 2021-11-05 01:07:40

solution1
0 2021-11-04 21:38:28

solution2
0 2021-11-04 22:20:31

solution3
0 2021-11-05 01:07:40