 |
|
01-03-2008, 06:09 AM
|
#1 (permalink)
|
|
Contributing Member
Join Date: 12-23-06
Location: India
Posts: 83
|
how to stop crawling of https:// urls from google
While checking through the indexed pages of my site, I found lots of
urls with https:// are indexed by Google. It is creating content
duplicity as I found two cached version of the same page, one with http://
and another with https://. I checked for the reason and discovered
that there are some links present in my site with https:// . I can't
stop posting of such urls as I don't have any control over my
visitors.
So my questions are:
1) How do I stop google from crawling the https:// urls?
2) How do I remove the urls that are already indexed with https://?
Please help me with your valuable suggestions asap. I'm in need of it.
|
|
|
01-03-2008, 06:31 AM
|
#2 (permalink)
|
|
Contributing Member
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 770
Latest Blog: None
|
This is quite a simple one. What you need to do is offer up 2 different versions of robots.txt depending if http or https has been used.
I can show you a live example from my website:-
Notice the following pages differ:
http://worldwidetrading.co.uk/robots.txt
https://worldwidetrading.co.uk/robots.txt
Why do they do this you ask?
Because I wanted to block googlebot from crawling the same info via http and https.
How did I do this?
Its actually quite simple. You create a second robots.txt file, I called mine robots_ssl.txt and I added entries to it to block all content.
Then add the following lines to your .htaccess file (in the root of your webhosting).
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_ssl.txt
If you dont have an .htaccess file, create a new one - be sure to put these 2 lines at the top of it:
Options +FollowSymLinks
RewriteEngine on
I hope this info is useful to you.
Cheers,
Gareth
Last edited by WorldwideTrading : 01-03-2008 at 06:33 AM.
Reason: Cant speel
|
|
|
01-03-2008, 09:01 AM
|
#3 (permalink)
|
|
Contributing Member
Join Date: 09-29-07
Location: Dublin, Ireland
Posts: 89
|
and add noindex,nofollow meta tag into the header of your ssl pages as well.
|
|
|
01-06-2008, 10:29 PM
|
#4 (permalink)
|
|
Contributing Member
Join Date: 12-23-06
Location: India
Posts: 83
|
Thank you WorldwideTrading. You made it so simple for me. http://www.seoworkers.com/seo-articl...and-https.html this link was great for me. Implemented the redirection of robots_ssl.txt through .htaccess. It worked fine for me.
Last edited by mitra : 01-06-2008 at 10:32 PM.
Reason: added appreciation to WorldwideTrading
|
|
|
01-07-2008, 02:17 AM
|
#5 (permalink)
|
|
Contributing Member
Join Date: 12-23-06
Location: India
Posts: 83
|
How do I remove the urls that are already indexed with https://? I heard google can only remove the 404 pages. I can't make those pages 404 as those pages are coming from the same section as normal pages.
|
|
|
01-07-2008, 04:15 AM
|
#6 (permalink)
|
|
Moderator
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
Latest Blog: None
|
Hi Mitra,
Quote:
Originally Posted by Eire-Web Design
and add noindex,nofollow meta tag into the header of your ssl pages as well.
|
Eire said it. Put this in the head tags of pages you don't want indexed:
<meta name="robots" content="noindex, nofollow">
That'll do it.
__________________
Clean, Fast and Tight
|
|
|
01-07-2008, 11:01 AM
|
#7 (permalink)
|
|
Contributing Member
Join Date: 12-23-06
Location: India
Posts: 83
|
Rankenstein, I've already blocked crawling of https://version. The only point is how to remove the existing https:// urls from google index. I have with google webmaster tool. But we can remove only http:// versions there. what to do with the https:// version
|
|
|
01-08-2008, 02:27 AM
|
#8 (permalink)
|
|
Moderator
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
Latest Blog: None
|
It is just a question of time - they will be removed the next time Googlebot comes round and finds itself blocked from the pages.
__________________
Clean, Fast and Tight
|
|
|
01-08-2008, 04:15 AM
|
#9 (permalink)
|
|
Moderator
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
Latest Blog: None
|
I'm surprised that Google Webmaster tools can't remove https though!
__________________
Clean, Fast and Tight
|
|
|
01-08-2008, 10:05 AM
|
#10 (permalink)
|
|
Contributing Member
Join Date: 03-23-07
Posts: 1,371
|
You can use disallow on robot.txt OR METAS
put your site that you want to block from crawling..
robots.txt Code:
User-agent: *
Disallow: /
Meta Code:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
|
|
|
01-08-2008, 10:39 AM
|
#11 (permalink)
|
|
Contributing Member
Join Date: 12-23-06
Location: India
Posts: 83
|
coolguy27 i've done that already. thanks for your help  it was my mistake. Google webmaster tool can remove https:// urls as well. I've already added with http:// and now I'm going to add http://. After adding http://, if I remove entire site from removal tool, only the https:// url will be deleted. It won't affect the normal urls with http:// . I'm bit nervous.
|
|
|
01-08-2008, 12:28 PM
|
#12 (permalink)
|
|
Contributing Member
Join Date: 07-12-07
Posts: 433
Latest Blog: None
|
Quote:
Originally Posted by coolguy27
You can use disallow on robot.txt OR METAS
put your site that you want to block from crawling..
robots.txt Code:
User-agent: *
Disallow: /
Meta Code:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
|
Nice Idea Coolguy...  but how about the indexed pages, is there a way to removed that? 
|
|
|
01-09-2008, 08:37 AM
|
#13 (permalink)
|
|
Contributing Member
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 770
Latest Blog: None
|
Quote:
Originally Posted by Rankenstein
Hi Mitra,
Eire said it. Put this in the head tags of pages you don't want indexed:
<meta name="robots" content="noindex, nofollow">
That'll do it.
|
Actually NO IT WONT! The problem occurs when google indexes the same page twice, once under http and once via https. Your sugestion would prevent google indexing the page altogether which is uhmm wrong.
|
|
|
01-09-2008, 08:39 AM
|
#14 (permalink)
|
|
Contributing Member
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 770
Latest Blog: None
|
Quote:
Originally Posted by coolguy27
You can use disallow on robot.txt OR METAS
put your site that you want to block from crawling..
robots.txt Code:
User-agent: *
Disallow: /
Meta Code:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
|
This also will not work. As by design only a single robots.txt file is used per domain whether it is accessed via http and https. Your suggestion will not acomplish what was asked! What you have to do is offer up different robots.txt files dependant on if http or https has been used. This simplest way of doing this is to use an .htaccess rewrite as I suggested above.
|
|
|
01-09-2008, 10:31 AM
|
#15 (permalink)
|
|
Contributing Member
Join Date: 12-23-06
Location: India
Posts: 83
|
I've submitted 63 https://urls in google webmaster url removal tool . It has been 24 hours since I posted those urls. How long it will take to remove those urls? The status is showing as pending. Analyze robots.txt section shows that the robots.txt last downloaded 17 hours ago.
|
|
|
01-10-2008, 03:38 AM
|
#16 (permalink)
|
|
Moderator
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
Latest Blog: None
|
Quote:
Originally Posted by WorldwideTrading
Actually NO IT WONT! The problem occurs when google indexes the same page twice, once under http and once via https. Your sugestion would prevent google indexing the page altogether which is uhmm wrong.
|
Doh! Yeah, you're right.
Put the robots text on the secure port instead. That's Google's official advice, IIRC.
__________________
Clean, Fast and Tight
|
|
|
01-10-2008, 05:06 AM
|
#17 (permalink)
|
|
Contributing Member
Join Date: 07-07-07
Location: Phoenix, Arizona
Posts: 503
|
content="index, nofollow"
|
|
|
01-11-2008, 09:23 PM
|
#18 (permalink)
|
|
Moderator
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
Latest Blog: None
|
The first answer by Worldwide Trading was spot on. Perfect example of me not reading the thread properly. Ignore anything after the second post. Thread solved and we're all too dumb to know it. 10/10 for Gareth.
__________________
Clean, Fast and Tight
|
|
|
01-16-2008, 08:20 AM
|
#19 (permalink)
|
|
Contributing Member
Join Date: 12-12-07
Posts: 100
Latest Blog: None
|
If i were you i would put a secondary robots.txt file that has a nofollow on https:// and for the http:// file i would put a follow robots.txt file
|
|
|
01-17-2008, 03:42 AM
|
#20 (permalink)
|
|
Junior Member
Join Date: 01-14-08
Location: Karachi Pakistan
Posts: 11
Latest Blog: None
|
use Rebot.txt
|
|
| |