简体   繁体   中英

How to grep for a URL in a file?

For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg

I want to get this URL, assuming it's the only url in the entire file.

cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*'

This works only if the URL doesn't have the plus signs.

How do I make work for + signs as well?

You missed the character class 0-9 (also useless use of cat) :

grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9+-]*/[a-zA-Z0-9.,-+]*' file.html

Slight improvement, use -i for case insensitivity and only match images .jpg or .jpeg .

grep -io 'http://ex[a-z.-]*/[a-z0-9+-]*/[a-z0-9.,-+]*[.jpe?g]' file.html

Or how about just:

grep -io 'http://ex.example.*[.jpe?g]' file.html

The following fixes your regular expression for this specific case (including numbers and plus-signs):

http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*

Demonstration:

echo "For example, I have a huge HTML file that contains img URL: http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg"

I want to get this URL, assuming it's the only url in the entire file.

cat file.html | grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z.-]*/[a-zA-Z.,-]*'

This works only if the URL doesn't have the plus signs. How do I make work for + signs as well?

cat file.html| grep -o 'http://ex[a-zA-Z.-]*/[a-zA-Z0-9.+-]*/[a-zA-Z0-9.+-]*'

output:

http://ex.example.com/hIh39j+ud9wr4/Uusfh.jpeg

This does not extract all valid URLs. There are plenty of other answers on this site about URL matching.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM