简体   繁体   English

Linux脚本返回网页上的域

[英]Linux script to return domains on a web page

I was tasked with this question: Write a bash script that takes a URL as its first argument and prints out statistics of the number of links per host/domain in the HTML of the URL. 我遇到了以下问题:编写一个bash脚本,该脚本以URL作为其第一个参数,并在URL的HTML中打印出每个主机/域的链接数统计信息。

So for instance given a URL like www.bbc.co.uk it might print something like 因此,例如,给定类似www.bbc.co.uk的URL,它可能会打印类似

www.bbc.co.uk: 45
bbc.com: 1
google.com: 2
Facebook.com: 4

That is, it should analyse the HTML of the page, pull out all the links, examine the href attribute, decide which links are to the same domain (figure that one out of course), and which are foreign, then produce statistics for the local ones and for the remote ones. 也就是说,它应该分析页面的HTML,提取所有链接,检查href属性,确定哪些链接指向相同的域(当然是哪个链接),以及哪些链接是外部的,然后为本地的和远程的。

Rules: You may use any set of standard Linux commands in your script. 规则:您可以在脚本中使用任何标准Linux命令集。 You may not use any higher-level programming languages such as C or Python or Perl. 您不得使用任何高级编程语言,例如C或Python或Perl。 You may however use awk, sed, etc. 但是,您可以使用awk,sed等。

I came up with the solution as follows: 我提出了以下解决方案:

#!/bin/sh

echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq | awk '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out

I was then told that "i must look at the data, and then check that your program deals satisfactorily with all the scenarios.This reports URLs but no the domains" Is there someone out there that can help me or point me in the right direction so as i can be able to achieve my goal? 然后我被告知:“我必须查看数据,然后检查您的程序是否满足所有方案的要求。这将报告URL,但没有域”是否有人可以帮助我或指出正确的方向?这样我就可以实现自己的目标? what am i missing or what is the script not doing? 我缺少什么或脚本没有做什么? I thought i had made it work as required. 我以为我已经按要求使它工作了。

The output of your script is: 脚本的输出为:

      7 http://news.bbc.co.uk/
      1 http://newsvote.bbc.co.uk/
      1 http://purl.org/
      8 http://static.bbci.co.uk/
      1 http://www.bbcamerica.com/
     23 http://www.bbc.com/
    179 http://www.bbc.co.uk/
      1 http://www.bbcknowledge.com/
      1 http://www.browserchoice.eu/

I think they mean that it should look more like: 我认为他们的意思是它应该看起来像:

      7 news.bbc.co.uk
      1 newsvote.bbc.co.uk
      1 purl.org
      8 static.bbci.co.uk
      1 www.bbcamerica.com
     23 www.bbc.com
    179 www.bbc.co.uk
      1 www.bbcknowledge.com
      1 www.browserchoice.eu

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM