Get URLs from a remote page and then download to txt file

Question

I tried lots of suggestion but i can't find a solution (I don't know if it's possible) I use terminal of Ubuntu 15.04

I'd need to download in a text file all of internal and external links from mywebsite.com/links_ (all links start with links_) For example http://www.mywebsite.com/links_sony.aspx I don't need all other links ex. mywebsite.com/index.aspx or conditions.asp etc. I use wget --spider --recursive --no-verbose --output-file="links.csv" http://www.mywebsite.com

Can you help me please? Thanks in advance

Answer 1

If you don't mind using a couple of other tools to coax wget, then you can try this bash script that employs awk, grep, wget and lynx:

#! /bin/bash
lynx --dump $1 | awk '/http/{print $2}' | grep $2 > /tmp/urls.txt
for i in $( cat /tmp/urls.txt ); do wget $i; done

Save the above script as getlinks and then run it as

./getlinks 'http://www.mywebsite.com' 'links_' > mycollection.txt

This approach does not load/need too many other tools; instead reuses commonly available tools.

You may have to play with quoting depending what shell you are using. The above works in standard bash and is not dependent on specific versions of these tools.

You could customize the part

do wget $1

with appropriate switches to meet your specific needs, such as recursive, spider, verbosity, etc. Insert those switches between wget and $1.

Get URLs from a remote page and then download to txt file

Question

1 answers

solution1
0 ACCPTED 2015-07-13 20:13:41

Get URLs from a remote page and then download to txt file

Question

1 answers

solution1 0 ACCPTED 2015-07-13 20:13:41

solution1
0 ACCPTED 2015-07-13 20:13:41