简体   繁体   中英

Linkify text file in Linux

I have parsed all rows containing urls from a text file and appended line breaks, and I want to make the links clickable in a new file.

How do I append <a href> -tags around only the urls, using standard linux tools, preferably awk? It needs to be automatable in cron.

For example,

source file chaturls.txt :

    12:30 <user> check this: https://link.to/stuff.jpg</br>
    13:47 <user4> https://another.link.lol eyyyy</br>

desired output in new file, chatlinkified.html :

12:30 <user> check this: <a href='https://link.to/stuff.jpg'>https://link.to/stuff.jpg</a></br>
13:47 <user4> <a href='https://another.link.lol'>https://another.link.lol</a> eyyyy</br>

I tried awk '{printf "<a href=\\"%s\\">%s</a><br>", $0,$0}' chaturls.txt > chatlinkified.html , but this makes the whole line an (invalid) clickable link.

sed -E 's@(https?://[^[:space:]/$.?#].[^[:space:]<]*)@<a href="\1">\1</a>@g' chaturls.txt > chatlinkified.html

You can use sed and refer back to the matched group with \\1 . NB. here I separate using the @ instead of / (as in s/../../g), you are free the use any character and this saves some escapes.

The regex for finding the URL does some validation checks for the first character after the https?:// and then proceeds the match until a space or the starting bracket of another tag.

You can if you want to use a more simpler regex for the url like, given in one of the comments https?://[^ ]*) which doesn't include this small validation.

You can find more extensive validated url regex here: https://mathiasbynens.be/demo/url-regex (But you have to convert from PHP regex to sed extended regex)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM