简体   繁体   中英

Looping through a text file containing domains using bash script

I have written a script that reads href tag of a webpage and fetches the links on that webpage and writes them to a text file. Now I have a text file containing links such as these for example:

http://news.bbc.co.uk/2/hi/health/default.stm
http://news.bbc.co.uk/weather/
http://news.bbc.co.uk/weather/forecast/8?area=London
http://newsvote.bbc.co.uk/1/shared/fds/hi/business/market_data/overview/default.stm
http://purl.org/dc/terms/
http://static.bbci.co.uk/bbcdotcom/0.3.131/style/3pt_ads.css
http://static.bbci.co.uk/frameworks/barlesque/2.8.7/desktop/3.5/style/main.css
http://static.bbci.co.uk/frameworks/pulsesurvey/0.7.0/style/pulse.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie6.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie7.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/ie8.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/css/bundles/main.css
http://static.bbci.co.uk/wwhomepage-3.5/1.0.48/img/iphone.png
http://www.bbcamerica.com/
http://www.bbc.com/future
http://www.bbc.com/future/
http://www.bbc.com/future/story/20120719-how-to-land-on-mars
http://www.bbc.com/future/story/20120719-road-opens-for-connected-cars
http://www.bbc.com/future/story/20120724-in-search-of-aliens
http://www.bbc.com/news/

I would like to be able to filter them such that I return something like:

http://www.bbc.com : 6
http://static.bbci.co.uk: 15

The values on the the side indicate the number of times the domain appears in the file. How can i be able to achieve this in bash considering I would have a loop going through the file. I am a newbie to bash shell scripting?

$ cut -d/ -f-3 urls.txt | sort | uniq -c                  
3 http://news.bbc.co.uk
1 http://newsvote.bbc.co.uk
1 http://purl.org
8 http://static.bbci.co.uk
1 http://www.bbcamerica.com
6 http://www.bbc.com

Just like this

egrep -o '^http://[^/]+' domain.txt | sort | uniq -c

Output of this on your example data:

3 http://news.bbc.co.uk/
1 http://newsvote.bbc.co.uk/
1 http://purl.org/
8 http://static.bbci.co.uk/
6 http://www.bbc.com/
1 http://www.bbcamerica.com/

This solution works even if your line is made up of a simple url without a trailing slash, so

http://www.bbc.com/news
http://www.bbc.com/
http://www.bbc.com

will all be in the same group.

If you want to allow https, then you can write:

egrep -o '^https?://[^/]+' domain.txt | sort | uniq -c

If other protocols are possible, such as ftp, mailto, etc. you can even be very loose and write:

egrep -o '^[^:]+://[^/]+' domain.txt | sort | uniq -c

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM