How to grep for a URL in a file?

Question

For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg

I want to get this URL, assuming it's the only url in the entire file.

cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*'

This works only if the URL doesn't have the plus signs.

How do I make work for + signs as well?

Answer 1

You missed the character class 0-9 (also useless use of cat) :

grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9+-]*/[a-zA-Z0-9.,-+]*' file.html

Slight improvement, use -i for case insensitivity and only match images .jpg or .jpeg .

grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.jpe?g]' file.html

Or how about just:

grep -io 'http://ex.example.*[.jpe?g]' file.html

Answer 2

The following fixes your regular expression for this specific case (including numbers and plus-signs):

http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*

Demonstration:

echo "For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg"

I want to get this URL, assuming it's the only url in the entire file.

cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*'

This works only if the URL doesn't have the plus signs. How do I make work for + signs as well?

cat file.html| grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*'

output:

http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg

This does not extract all valid URLs. There are plenty of other answers on this site about URL matching.