[英]Python web scraping - print only part of url
I have a Python web scraping program which gets all links from given sites and I've managed to print out the domain name of each link and path after that. 我有一个Python网络抓取程序,该程序从给定站点获取所有链接,并且在此之后我设法打印出每个链接和路径的域名。
The code: 编码:
import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
url = "http://www.sparkbrowser.com"
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(url, timeout=5)
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)
for link in br.links(text_regex=re.compile('^((?!IMG).)*$')):
newurl = urlparse.urljoin(link.base_url, link.url)
base = link.base_url
print base," - ",newurl
and it gets me results like this: 它给我这样的结果:
http://www.sparkbrowser.com - http://www.sparkbrowser.com
http://www.sparkbrowser.com - http://sparkbrowser.com
http://www.sparkbrowser.com - http://www.sparkbrowser.com/index.php
http://www.sparkbrowser.com - http://www.sparkbrowser.com/download.php
http://www.sparkbrowser.com - http://www.sparkbrowser.com/about.php
http://www.sparkbrowser.com - http://www.sparkbrowser.com/features.php
http://www.sparkbrowser.com - http://www.sparkbrowser.com/spark.php
etc....
I was wondering how to get only sparkbrowser.com
or sparkbrowser
only from the given address? 我想知道如何
sparkbrowser
给定地址获取sparkbrowser.com
或sparkbrowser
?
I know how to separate the domain name, http://www.sparkbrowser.com
and the path, but I don't know if it is possible to print parts of URL as I mentioned 我知道如何分隔域名
http://www.sparkbrowser.com
和路径,但是我不知道是否可以打印出我提到的部分URL
I've tried something with Regex but I was not successful. 我已经用Regex尝试了一些东西,但是没有成功。
Any help is welcome. 欢迎任何帮助。
Use the urlparse.urlsplit()
function to split out the URL into constituent parts: 使用
urlparse.urlsplit()
函数将URL分成几个组成部分:
>>> from urlparse import urlsplit
>>> urlsplit('http://www.sparkbrowser.com/index.php')
SplitResult(scheme='http', netloc='www.sparkbrowser.com', path='/index.php', query='', fragment='')
>>> _.netloc
'www.sparkbrowser.com'
You can then split the .netloc
value further if desired: 然后,可以根据需要进一步拆分
.netloc
值:
>>> '.'.join(res.netloc.split('.')[-2:])
'sparkbrowser.com'
or, (better), use the publicsuffix
library to extract the public suffix of a given domain name: 或者(更好),使用
publicsuffix
库提取给定域名的公共后缀:
>>> from publicsuffix import PublicSuffixList
>>> psl = PublicSuffixList()
>>> psl.get_public_suffix(res.netloc)
'sparkbrowser.com'
>>> psl.get_public_suffix('www.example.domain.co.uk')
'domain.co.uk'
newurl.split('。com')[1]应该可以解决问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.