简体   繁体   中英

mod_rewrite rule: How to block direct access to URL that contains specific word?

I need to resolve following scenario by using mod_rewrite rule.

If a visitor of my website follow this path (see below) then after visiting the first page, he/she should be able to also visit the second (more formatted) URL:

http://www.example.com/page/
http://www.example.com/page/?jump2=24&autoplay=1#anchor

But if the visitor comes straight to a formatted URL, it should be blocked:

http://www.example.com/page/?jump2=24&autoplay=1#anchor

How do I go about doing this using .htaccess file? I have tons of URLs like these and I need to block search engines leading to those formatted pages as well as bots - it's literally killing my server.

You can use cookies to check if the user has visited the page already.

Create a cookie in the index page, if there are no query strings. Then check if it is set when user requests the page with query strings.

For blocking urls from search engines, use robots.txt

HTTP is stateless, so this is not a simple question. You're going to have to basically fudge it in some way, so there's no simple drop-in solution, and having a cookie as suggested in the other answer is a reasonable approach (a session cookie or something else). If you're ruling out cookies, then it reduces options a lot. But...

You could generate a token on the page, and then check for that token in the URLs. The token can be based on the date, so it changes regularly, and perhaps only allow today and yesterday's token. If the token isn't present in the parameters, the request gets rejected. You can use a RewriteMap to source the current tokens from your .htaccess .

Another option to mention is that bad bots can be blocked precisely because they ignore robots.txt . So you can set a bot trap script, linked from every page and hidden in CSS, and then block the IP instantly from that script when it is visited (mine blocks from the Firewall). The trap is excluded in robots.txt .

Once that is in place, so robots.txt abusers get banned instantly you can put something like http://www.example.com/page/? in your robots.txt (since robots.txt specifies the start of the URL to disallow, not complete URL) and also set canonical URLs in your pages. Other search engine options also become useful, you can block the robots you don't want who do respect robots.txt and set Google Search Console to ignore those display parameters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM