简体   繁体   中英

Stop search engines to index specific parts of the page

I have a php page that renders a book of let's say 100 pages. Each page has a specific url (eg /my-book/page-one , /my-book/page-two etc).

When flipping the pages, I change the url using the history API, using url.js .

Since all the book content is rendered from the server side, the problem is that the content is indexed by search engines (especially I'm referring to Google), but the urls are wrong (eg it finds a snippet on page-two but the url is page-one ).

How to stop search engines (at least Google) to index all the content on the page, but index only the visible book page?

Would it work if I render the content in a different way: for example, <div data-page-number="1" data-content="Lorem ipsum..."></div> and then on the JavaScript side to change that in the needed format? That would make the page slower and in fact I'm not sure if Google will not index the changed content by JavaScript.

The code looks like this:

<div data-page="1">Page 1</div>
<div data-page="2">Page 2</div>
<div data-page="3" class="current-page">Page 3</div>
<div data-page="4">Page 4</div>
<div data-page="5">Page 5</div>

Then only visible div is the .current-page one. The same content is served on multiple urls because that's needed so the user can flip between pages.

For example, /book/page/3 will render this piece of HTML while /book/page/4 renders the same thing, the only difference being the current-page class which is added to the 4th element.

Google did index different urls, but it did it wrong: for example, the snippet Page 5 links to /book/page/2 which renders to the user Page 2 (not Page 5 ).

How to tell Google (and other search engines) I'm only interested to index the content in the .current-page ?

As I understood he issue is that you have same content for many urls. Like:

www.my-awesome-domain.com/my-book/page/42

www.my-awesome-domain.com//my-book/page/7

And the visible content of the page is adjustable by JavaScript, that User Execute when he clicks some elements on your site.

In This case you need to do 2 things:

  1. Mark your URL's as Canonical pages in any of the ways described in this google document: https://support.google.com/webmasters/answer/139066?hl=en
  2. You need add a feature that each page will load to the same state after full page refresh, for example you can use hash parameter when navigating as desiribed in the article here : or here is the overview of the technique

Today google bot is executing JavaScript as announced in their official blog: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html

So if you achieve proper page behavior when hitting Refresh (F5) and Will specify the canonical pages property, pages will be correctly crawled, and when you will follow the link you will get to the linked page.

If you need more guidance how to do it in url.js Please post another question (so it's will be proper documented for others) and I will be glad to help.

The answere is really simple: you can't do it. There is no technical possibility to keep the same content under different URLs and ask search engines to index only part of it.

If you are OK with having only one page indexed you can use, as suggested before, canonical URLs. You place the canonical URL that links to the main page on every sub-page.

You may find a "hack" that uses special tags used for Google Search Appliance: googleon and googleoff .

https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/preparing.html

The only issue is this will most likely not work with Google Bot (at least no one will guarantee it will) or any other search engine.

If you target specifically Google, you can use the "googleoff" directive

See Excluding Unwanted Text from the Index

Turns off all the attributes. Text between the tags is not indexed, is not associated with anchor text, or used for a snippet.

<!--googleoff: all--><div data-page="1">Page 1</div>
<div data-page="2">Page 2</div><!--googleon: all-->
<div data-page="3" class="current-page">Page 3</div>
<!--googleoff: all--><div data-page="4">Page 4</div>
<div data-page="5">Page 5</div><!--googleon: all-->

If you want to hide the text from other search engine, you should use a javascript alternative, for instance, loading the next or previous page in the DOM from an ajax request when the user click on a button.

I dont think you will be able to achieve what you are looking for.

I cant see how robots.txt would have any affect. Canonical tags dont work on divs.

Google has spoken about sites like these in the past and made some suggestions for indexing, here are a couple of links that may help :

https://www.seroundtable.com/seo-single-page-12964.html

https://www.seroundtable.com/google-on-crawling-javascript-sites-progressive-web-apps-21737.html

Save the content in a JSON file which you do not render in the HTML. From the server, serve only the correct page: the content which is visible to the user.

When the user clicks the buttons (prev/next page links etc), render using JavaScript the content you have the JSON file and change the url like you're already doing.

That way you know you always serve from the server the right content and the Google bot will obviously index the pages correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM