Webmaster Forum

Go Back   Webmaster Forum > Marketing Forums > Google Forum

Google Forum Discuss Google related issues.


Reply
 
LinkBack Thread Tools Display Modes
Share |
  #1 (permalink)  
Old 01-03-2008, 06:09 AM
Banned
Latest Blog:
None

 
Join Date: 12-23-06
Location: India
Posts: 92
iTrader: 0 / 0%
how to stop crawling of https:// urls from google

While checking through the indexed pages of my site, I found lots of
urls with https:// are indexed by Google. It is creating content
duplicity as I found two cached version of the same page, one with http://
and another with https://. I checked for the reason and discovered
that there are some links present in my site with https:// . I can't
stop posting of such urls as I don't have any control over my
visitors.

So my questions are:

1) How do I stop google from crawling the https:// urls?
2) How do I remove the urls that are already indexed with https://?

Please help me with your valuable suggestions asap. I'm in need of it.
 
Reply With Quote

Advertisement

Advertisement

  #2 (permalink)  
Old 01-03-2008, 06:31 AM
WorldwideTrading's Avatar
Senior Member
Latest Blog:
None

 
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 563
iTrader: 0 / 0%
This is quite a simple one. What you need to do is offer up 2 different versions of robots.txt depending if http or https has been used.

I can show you a live example from my website:-

Notice the following pages differ:
http://worldwidetrading.co.uk/robots.txt
https://worldwidetrading.co.uk/robots.txt

Why do they do this you ask?
Because I wanted to block googlebot from crawling the same info via http and https.


How did I do this?
Its actually quite simple. You create a second robots.txt file, I called mine robots_ssl.txt and I added entries to it to block all content.

Then add the following lines to your .htaccess file (in the root of your webhosting).
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_ssl.txt

If you dont have an .htaccess file, create a new one - be sure to put these 2 lines at the top of it:
Options +FollowSymLinks
RewriteEngine on

I hope this info is useful to you.

Cheers,


Gareth

Last edited by WorldwideTrading; 01-03-2008 at 06:33 AM. Reason: Cant speel
 
Reply With Quote
  #3 (permalink)  
Old 01-03-2008, 09:01 AM
Eire-Web Design's Avatar
Contributing Member
 
Join Date: 09-29-07
Location: Dublin, Ireland
Posts: 91
iTrader: 0 / 0%
and add noindex,nofollow meta tag into the header of your ssl pages as well.
 
Reply With Quote
  #4 (permalink)  
Old 01-06-2008, 10:29 PM
Banned
Latest Blog:
None

 
Join Date: 12-23-06
Location: India
Posts: 92
iTrader: 0 / 0%
Thank you WorldwideTrading. You made it so simple for me. http://www.seoworkers.com/seo-articl...and-https.html this link was great for me. Implemented the redirection of robots_ssl.txt through .htaccess. It worked fine for me.

Last edited by mitra; 01-06-2008 at 10:32 PM. Reason: added appreciation to WorldwideTrading
 
Reply With Quote
  #5 (permalink)  
Old 01-07-2008, 02:17 AM
Banned
Latest Blog:
None

 
Join Date: 12-23-06
Location: India
Posts: 92
iTrader: 0 / 0%
How do I remove the urls that are already indexed with https://? I heard google can only remove the 404 pages. I can't make those pages 404 as those pages are coming from the same section as normal pages.
 
Reply With Quote
  #6 (permalink)  
Old 01-07-2008, 04:15 AM
Rankenstein's Avatar
v7n Mentor
Latest Blog:
None

 
Join Date: 11-14-05
Location: Manchester
Posts: 3,140
iTrader: 0 / 0%
Hi Mitra,

Quote:
Originally Posted by Eire-Web Design View Post
and add noindex,nofollow meta tag into the header of your ssl pages as well.
Eire said it. Put this in the head tags of pages you don't want indexed:

<meta name="robots" content="noindex, nofollow">

That'll do it.
__________________
Clean, Fast and Tight
 
Reply With Quote
  #7 (permalink)  
Old 01-07-2008, 11:01 AM
Banned
Latest Blog:
None

 
Join Date: 12-23-06
Location: India
Posts: 92
iTrader: 0 / 0%
Rankenstein, I've already blocked crawling of https://version. The only point is how to remove the existing https:// urls from google index. I have with google webmaster tool. But we can remove only http:// versions there. what to do with the https:// version
 
Reply With Quote
  #8 (permalink)  
Old 01-08-2008, 02:27 AM
Rankenstein's Avatar
v7n Mentor
Latest Blog:
None

 
Join Date: 11-14-05
Location: Manchester
Posts: 3,140
iTrader: 0 / 0%
It is just a question of time - they will be removed the next time Googlebot comes round and finds itself blocked from the pages.
__________________
Clean, Fast and Tight
 
Reply With Quote
  #9 (permalink)  
Old 01-08-2008, 04:15 AM
Rankenstein's Avatar
v7n Mentor
Latest Blog:
None

 
Join Date: 11-14-05
Location: Manchester
Posts: 3,140
iTrader: 0 / 0%
I'm surprised that Google Webmaster tools can't remove https though!
__________________
Clean, Fast and Tight
 
Reply With Quote
  #10 (permalink)  
Old 01-08-2008, 10:05 AM
coolguy27's Avatar
Contributing Member
Latest Blog:
None

 
Join Date: 03-23-07
Location: Makati, Philippines
Posts: 1,725
iTrader: 0 / 0%
You can use disallow on robot.txt OR METAS

put your site that you want to block from crawling..

robots.txt Code:

User-agent: *
Disallow: /


Meta Code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
 
Reply With Quote
  #11 (permalink)  
Old 01-08-2008, 10:39 AM
Banned
Latest Blog:
None

 
Join Date: 12-23-06
Location: India
Posts: 92
iTrader: 0 / 0%
coolguy27 i've done that already. thanks for your help it was my mistake. Google webmaster tool can remove https:// urls as well. I've already added with http:// and now I'm going to add http://. After adding http://, if I remove entire site from removal tool, only the https:// url will be deleted. It won't affect the normal urls with http:// . I'm bit nervous.
 
Reply With Quote
  #12 (permalink)  
Old 01-08-2008, 12:28 PM
poksa's Avatar
Contributing Member
Latest Blog:
None

 
Join Date: 07-12-07
Posts: 527
iTrader: 0 / 0%
Quote:
Originally Posted by coolguy27 View Post
You can use disallow on robot.txt OR METAS

put your site that you want to block from crawling..

robots.txt Code:

User-agent: *
Disallow: /


Meta Code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">


Nice Idea Coolguy... but how about the indexed pages, is there a way to removed that?
__________________
[I dont participate here, I just come back and change my sig occasionally]
 
Reply With Quote
  #13 (permalink)  
Old 01-09-2008, 08:37 AM
WorldwideTrading's Avatar
Senior Member
Latest Blog:
None

 
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 563
iTrader: 0 / 0%
Quote:
Originally Posted by Rankenstein View Post
Hi Mitra,



Eire said it. Put this in the head tags of pages you don't want indexed:

<meta name="robots" content="noindex, nofollow">

That'll do it.
Actually NO IT WONT! The problem occurs when google indexes the same page twice, once under http and once via https. Your sugestion would prevent google indexing the page altogether which is uhmm wrong.
 
Reply With Quote
  #14 (permalink)  
Old 01-09-2008, 08:39 AM
WorldwideTrading's Avatar
Senior Member
Latest Blog:
None

 
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 563
iTrader: 0 / 0%
Quote:
Originally Posted by coolguy27 View Post
You can use disallow on robot.txt OR METAS

put your site that you want to block from crawling..

robots.txt Code:

User-agent: *
Disallow: /


Meta Code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This also will not work. As by design only a single robots.txt file is used per domain whether it is accessed via http and https. Your suggestion will not acomplish what was asked! What you have to do is offer up different robots.txt files dependant on if http or https has been used. This simplest way of doing this is to use an .htaccess rewrite as I suggested above.
 
Reply With Quote
  #15 (permalink)  
Old 01-09-2008, 10:31 AM
Banned
Latest Blog:
None

 
Join Date: 12-23-06
Location: India
Posts: 92
iTrader: 0 / 0%
I've submitted 63 https://urls in google webmaster url removal tool . It has been 24 hours since I posted those urls. How long it will take to remove those urls? The status is showing as pending. Analyze robots.txt section shows that the robots.txt last downloaded 17 hours ago.
 
Reply With Quote
  #16 (permalink)  
Old 01-10-2008, 03:38 AM
Rankenstein's Avatar
v7n Mentor
Latest Blog:
None

 
Join Date: 11-14-05
Location: Manchester
Posts: 3,140
iTrader: 0 / 0%
Quote:
Originally Posted by WorldwideTrading View Post
Actually NO IT WONT! The problem occurs when google indexes the same page twice, once under http and once via https. Your sugestion would prevent google indexing the page altogether which is uhmm wrong.
Doh! Yeah, you're right.

Put the robots text on the secure port instead. That's Google's official advice, IIRC.
__________________
Clean, Fast and Tight
 
Reply With Quote
  #17 (permalink)  
Old 01-10-2008, 05:06 AM
sleepyhead's Avatar
Contributing Member
 
Join Date: 07-07-07
Location: Phoenix, Arizona
Posts: 478
iTrader: 0 / 0%
content="index, nofollow"
 
Reply With Quote
  #18 (permalink)  
Old 01-11-2008, 09:23 PM
Rankenstein's Avatar
v7n Mentor
Latest Blog:
None

 
Join Date: 11-14-05
Location: Manchester
Posts: 3,140
iTrader: 0 / 0%
The first answer by Worldwide Trading was spot on. Perfect example of me not reading the thread properly. Ignore anything after the second post. Thread solved and we're all too dumb to know it. 10/10 for Gareth.
__________________
Clean, Fast and Tight
 
Reply With Quote
  #19 (permalink)  
Old 01-16-2008, 08:20 AM
Member
Latest Blog:
None

 
Join Date: 12-12-07
Posts: 100
iTrader: 0 / 0%
If i were you i would put a secondary robots.txt file that has a nofollow on https:// and for the http:// file i would put a follow robots.txt file
 
Reply With Quote
  #20 (permalink)  
Old 01-17-2008, 03:42 AM
EnomSoft.com's Avatar
Junior Member
 
Join Date: 01-14-08
Location: Karachi Pakistan
Posts: 11
iTrader: 0 / 0%
use Rebot.txt
 
Reply With Quote
Go Back   Webmaster Forum > Marketing Forums > Google Forum

Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
https vs http google serps ccb056 Google Forum 4 02-28-2008 05:35 PM
How to stop Bots crawling content of section of the page ? poseidon Coding Forum 1 10-30-2007 10:26 AM
Blocking https pages from Google simplyDone Google Forum 6 10-08-2007 08:20 AM
Going from https to https with out warnings? Buskerdoo Tech Talk 5 02-03-2007 12:24 AM
How can I stop showing URLs at bottom of browser? Michel Z. Web Design Lobby 11 03-05-2006 12:57 PM


V7N Network
Get exposure! V7N I Love Photography V7N SEO Blog V7N Directory


All times are GMT -7. The time now is 02:38 PM.
Powered by vBulletin
Copyright 2000-2014 Jelsoft Enterprises Limited.
Copyright © 2003 - 2014 Escalate Media




Search Engine Optimization by vBSEO 3.6.0 RC 2 ©2011, Crawlability, Inc.