I'm trying to create a script which detects if the given URL
points to Joomla
website.
For now I have:
def is_joomla(url):
manifest = url + '/administrator/manifests/files/joomla.xml'
# get XML
if "joomla" in XML: # simplified
return True
return False
Another option is to check for Joomla string in html
in lowercase but it's not reliable. I know that some Joomla pages can't be detected but most of them should be.
Are there another signs which helps me to detect if it's Joomla? I don't care about version.
I would check different points to make sure the page you´re crawling is a joomla site.
<meta name="generator" content="Joomla! - Open Source Content Management" />
I think if options 1 and 3 return true, you can be very sure that the crawled page is a joomla site. Option 2 is not save, because some people might delete there robots.txt.
You can see a website is running Joomla by simple check the response code ( X-CF-Powered-By) from the site here a sample:
Antwort HTTP/1.1 200 OK
Date Fri, 28 Jul 2017 09:02:35 GMT
Content-Type text/html; charset=utf-8
Transfer-Encoding chunked
X-CF-Powered-By CF-Joomla 0.1.5
P3P CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires Wed, 02 Aug 2017 09:02:35 GMT
Before going ahead you need to establish a connection till the time this html file is not in the disk. Apart from that you can use BeautifulSoup to parse it and use the tag you are looking for.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.