简体   繁体   English

Bash脚本返回域而不是URL

[英]Bash script to return domains instead of URL's

I have this bash script that i wrote to analyse the html of any given web page. 我有这个bash脚本,我写了分析任何给定网页的HTML。 What its actually supposed to do is to return the domains on that page. 它实际应该做的是返回该页面上的域。 Currently its returning the number of URL's on that web page. 目前它返回该网页上的URL数量。

#!/bin/sh

echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out

How can i get it to return the domains instead of the URL's. 如何让它返回域而不是URL。 From my programming knowledge I know its supposed to do parsing from the right but i am a newbie at bash scripting. 根据我的编程知识,我知道它应该从右边解析,但我是bash脚本的新手。 Can someone please help me. 有人可以帮帮我吗。 This is as far as I have gone. 这是我已经走了。

I know there's a better way to do this in awk but you can do this with sed, by appending this after your awk '/http/' : 我知道有更好的方法在awk中执行此操作,但您可以使用sed执行此操作,方法是在您的awk '/http/'

| sed -e 's;https\?://;;' | sed -e 's;/.*$;;'

Then you want to move your sort and uniq to the end of that. 然后你想把你的排序和uniq移动到那个结尾。

So that the whole line will look like: 这样整条线看起来像:

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | awk   '/http/' | sed -e 's;https\?://;;' | sed -e 's;/.*$;;' | sort | uniq -c > out)

You can get rid of this line: 你可以摆脱这一行:

output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

EDIT 2: Please note, that you might want to adapt the search patterns in the sed expressions to your needs. 编辑2:请注意,您可能希望根据需要调整sed表达式中的搜索模式。 This solution considers only http[s]?:// -protocol and www. 该解决方案仅考虑http[s]?:// -protocol和www. -servers... - 服务器...

EDIT: 编辑:
If you want count and domains: 如果你想要计数和域名:

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http[s]?://\([^/]*\).*$@\1@p' | \
   sort | \
     uniq -c | \
       sed 's/www.//'

gives

2 wordpress.org
10 zelleke.com

Original Answer: 原答案:

You might want to use lynx for extracting links from URL 您可能希望使用lynx从URL中提取链接

lynx -dump -listonly http://zelleke.com

gives

# blank line at the top of the output
References

   1. http://www.zelleke.com/feed/
   2. http://www.zelleke.com/comments/feed/
   3. http://www.zelleke.com/
   4. http://www.zelleke.com/#content
   5. http://www.zelleke.com/#secondary
   6. http://www.zelleke.com/
   7. http://www.zelleke.com/wp-login.php
   8. http://www.zelleke.com/feed/
   9. http://www.zelleke.com/comments/feed/
  10. http://wordpress.org/
  11. http://www.zelleke.com/
  12. http://wordpress.org/

Based on this output you achieve desired result with: 基于此输出,您可以获得所需的结果:

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http://\([^/]*\).*$@\1@p' | \
   sort -u | \
     sed 's/www.//'

gives

wordpress.org
zelleke.com

You can remove path from url with sed: 您可以使用sed从网址中删除路径:

sed s@http://@@; s@/.*@@

I want to say you also, that these two lines are wrong: 我想也说你,这两行是错误的:

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

You must make either redirection ( > out ), or command substitution $() , but not two thing at the same time. 您必须进行重定向( > out )或命令替换$() ,但不能同时进行两项操作。 Because the variables will be empty in this case. 因为在这种情况下变量将为空。

This part 这部分

content=$(wget "$url" -q -O -)
echo $content > $file

would be also better to write this way: 写这种方式也会更好:

wget "$url" -q -O - > $file

you may be interested by it: 您可能对此感兴趣:

http://tools.ietf.org/html/rfc3986#appendix-B http://tools.ietf.org/html/rfc3986#appendix-B

explain the way to parse uri using regex. 解释使用正则表达式解析uri的方法。

so you can parse an uri from the left this way, and extract the "authority" that contains domain and subdomain names. 因此,您可以通过这种方式从左侧解析uri,并提取包含域和子域名的“权限”。

sed -r 's_^([^:/?#]+:)?(//([^/?#]*))?.*_\3_g';
grep -Eo '[^\.]+\.[^\.]+$' # pipe with first line, give what you need

this is interesting to: 这很有趣:

http://www.scribd.com/doc/78502575/124/Extracting-the-Host-from-a-URL http://www.scribd.com/doc/78502575/124/Extracting-the-Host-from-a-URL

assuming that url always begin this way 假设url总是以这种方式开始

https?://(www\.)?

is really hazardous. 真的很危险

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM