简体   繁体   中英

Get URLs from a remote page and then download to txt file

I tried lots of suggestion but i can't find a solution (I don't know if it's possible) I use terminal of Ubuntu 15.04

I'd need to download in a text file all of internal and external links from mywebsite.com/links_ (all links start with links_) For example http://www.mywebsite.com/links_sony.aspx I don't need all other links ex. mywebsite.com/index.aspx or conditions.asp etc. I use wget --spider --recursive --no-verbose --output-file="links.csv" http://www.mywebsite.com

Can you help me please? Thanks in advance

If you don't mind using a couple of other tools to coax wget, then you can try this bash script that employs awk, grep, wget and lynx:

#! /bin/bash
lynx --dump $1 | awk '/http/{print $2}' | grep $2 > /tmp/urls.txt
for i in $( cat /tmp/urls.txt ); do wget $i; done

Save the above script as getlinks and then run it as

./getlinks 'http://www.mywebsite.com' 'links_' > mycollection.txt

This approach does not load/need too many other tools; instead reuses commonly available tools.

You may have to play with quoting depending what shell you are using. The above works in standard bash and is not dependent on specific versions of these tools.

You could customize the part

do wget $1

with appropriate switches to meet your specific needs, such as recursive, spider, verbosity, etc. Insert those switches between wget and $1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM