Webmaster Forum


Go Back   Webmaster Forum > Marketing Forums > Google Forum
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Google Forum Discuss Google related issues.

   

Reply
 
LinkBack Thread Tools Display Modes
Old 01-03-2008, 06:09 AM   #1 (permalink)
Contributing Member
 
mitra's Avatar
 
Join Date: 12-23-06
Location: India
Posts: 83
iTrader: 0 / 0%
mitra is liked by many
Send a message via Yahoo to mitra
how to stop crawling of https:// urls from google

While checking through the indexed pages of my site, I found lots of
urls with https:// are indexed by Google. It is creating content
duplicity as I found two cached version of the same page, one with http://
and another with https://. I checked for the reason and discovered
that there are some links present in my site with https:// . I can't
stop posting of such urls as I don't have any control over my
visitors.

So my questions are:

1) How do I stop google from crawling the https:// urls?
2) How do I remove the urls that are already indexed with https://?

Please help me with your valuable suggestions asap. I'm in need of it.
__________________
Watch out Refine Studio | Simply Blogging
mitra is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-03-2008, 06:31 AM   #2 (permalink)
Contributing Member
 
WorldwideTrading's Avatar
 
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 770
iTrader: 0 / 0%
Latest Blog:
None

WorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really nice
This is quite a simple one. What you need to do is offer up 2 different versions of robots.txt depending if http or https has been used.

I can show you a live example from my website:-

Notice the following pages differ:
http://worldwidetrading.co.uk/robots.txt
https://worldwidetrading.co.uk/robots.txt

Why do they do this you ask?
Because I wanted to block googlebot from crawling the same info via http and https.


How did I do this?
Its actually quite simple. You create a second robots.txt file, I called mine robots_ssl.txt and I added entries to it to block all content.

Then add the following lines to your .htaccess file (in the root of your webhosting).
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_ssl.txt

If you dont have an .htaccess file, create a new one - be sure to put these 2 lines at the top of it:
Options +FollowSymLinks
RewriteEngine on

I hope this info is useful to you.

Cheers,


Gareth
__________________
Current Project: http://www.starrecottages.co.uk/

Last edited by WorldwideTrading : 01-03-2008 at 06:33 AM. Reason: Cant speel
WorldwideTrading is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-03-2008, 09:01 AM   #3 (permalink)
Contributing Member
 
Eire-Web Design's Avatar
 
Join Date: 09-29-07
Location: Dublin, Ireland
Posts: 89
iTrader: 0 / 0%
Eire-Web Design is liked by many
and add noindex,nofollow meta tag into the header of your ssl pages as well.
Eire-Web Design is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-06-2008, 10:29 PM   #4 (permalink)
Contributing Member
 
mitra's Avatar
 
Join Date: 12-23-06
Location: India
Posts: 83
iTrader: 0 / 0%
mitra is liked by many
Send a message via Yahoo to mitra
Thank you WorldwideTrading. You made it so simple for me. http://www.seoworkers.com/seo-articl...and-https.html this link was great for me. Implemented the redirection of robots_ssl.txt through .htaccess. It worked fine for me.
__________________
Watch out Refine Studio | Simply Blogging

Last edited by mitra : 01-06-2008 at 10:32 PM. Reason: added appreciation to WorldwideTrading
mitra is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-07-2008, 02:17 AM   #5 (permalink)
Contributing Member
 
mitra's Avatar
 
Join Date: 12-23-06
Location: India
Posts: 83
iTrader: 0 / 0%
mitra is liked by many
Send a message via Yahoo to mitra
How do I remove the urls that are already indexed with https://? I heard google can only remove the 404 pages. I can't make those pages 404 as those pages are coming from the same section as normal pages.
__________________
Watch out Refine Studio | Simply Blogging
mitra is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-07-2008, 04:15 AM   #6 (permalink)
Moderator
 
Rankenstein's Avatar
 
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
iTrader: 0 / 0%
Latest Blog:
None

Rankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest order
Hi Mitra,

Quote:
Originally Posted by Eire-Web Design View Post
and add noindex,nofollow meta tag into the header of your ssl pages as well.
Eire said it. Put this in the head tags of pages you don't want indexed:

<meta name="robots" content="noindex, nofollow">

That'll do it.
__________________
Clean, Fast and Tight
Rankenstein is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-07-2008, 11:01 AM   #7 (permalink)
Contributing Member
 
mitra's Avatar
 
Join Date: 12-23-06
Location: India
Posts: 83
iTrader: 0 / 0%
mitra is liked by many
Send a message via Yahoo to mitra
Rankenstein, I've already blocked crawling of https://version. The only point is how to remove the existing https:// urls from google index. I have with google webmaster tool. But we can remove only http:// versions there. what to do with the https:// version
__________________
Watch out Refine Studio | Simply Blogging
mitra is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-08-2008, 02:27 AM   #8 (permalink)
Moderator
 
Rankenstein's Avatar
 
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
iTrader: 0 / 0%
Latest Blog:
None

Rankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest order
It is just a question of time - they will be removed the next time Googlebot comes round and finds itself blocked from the pages.
__________________
Clean, Fast and Tight
Rankenstein is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-08-2008, 04:15 AM   #9 (permalink)
Moderator
 
Rankenstein's Avatar
 
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
iTrader: 0 / 0%
Latest Blog:
None

Rankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest order
I'm surprised that Google Webmaster tools can't remove https though!
__________________
Clean, Fast and Tight
Rankenstein is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-08-2008, 10:05 AM   #10 (permalink)
Contributing Member
 
coolguy27's Avatar
 
Join Date: 03-23-07
Posts: 1,371
iTrader: 0 / 0%
coolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nicecoolguy27 is just really nice
You can use disallow on robot.txt OR METAS

put your site that you want to block from crawling..

robots.txt Code:

User-agent: *
Disallow: /


Meta Code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
__________________
Auto Bike Racks
Discount Auto Parts (Great Prices of Auto Parts Online)
coolguy27 is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-08-2008, 10:39 AM   #11 (permalink)
Contributing Member
 
mitra's Avatar
 
Join Date: 12-23-06
Location: India
Posts: 83
iTrader: 0 / 0%
mitra is liked by many
Send a message via Yahoo to mitra
coolguy27 i've done that already. thanks for your help it was my mistake. Google webmaster tool can remove https:// urls as well. I've already added with http:// and now I'm going to add http://. After adding http://, if I remove entire site from removal tool, only the https:// url will be deleted. It won't affect the normal urls with http:// . I'm bit nervous.
__________________
Watch out Refine Studio | Simply Blogging
mitra is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-08-2008, 12:28 PM   #12 (permalink)
Contributing Member
 
poksa's Avatar
 
Join Date: 07-12-07
Posts: 433
iTrader: 0 / 0%
Latest Blog:
None

poksa is on the right pathpoksa is on the right path
Quote:
Originally Posted by coolguy27 View Post
You can use disallow on robot.txt OR METAS

put your site that you want to block from crawling..

robots.txt Code:

User-agent: *
Disallow: /


Meta Code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">


Nice Idea Coolguy... but how about the indexed pages, is there a way to removed that?
__________________
Cheap Auto Insurance Rates & Coverages
Online Weather Forecast on Weather Blogs
poksa is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-09-2008, 08:37 AM   #13 (permalink)
Contributing Member
 
WorldwideTrading's Avatar
 
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 770
iTrader: 0 / 0%
Latest Blog:
None

WorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really nice
Quote:
Originally Posted by Rankenstein View Post
Hi Mitra,



Eire said it. Put this in the head tags of pages you don't want indexed:

<meta name="robots" content="noindex, nofollow">

That'll do it.
Actually NO IT WONT! The problem occurs when google indexes the same page twice, once under http and once via https. Your sugestion would prevent google indexing the page altogether which is uhmm wrong.
__________________
Current Project: http://www.starrecottages.co.uk/
WorldwideTrading is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-09-2008, 08:39 AM   #14 (permalink)
Contributing Member
 
WorldwideTrading's Avatar
 
Join Date: 01-22-06
Location: Exeter, East Devon, England, UK
Posts: 770
iTrader: 0 / 0%
Latest Blog:
None

WorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really niceWorldwideTrading is just really nice
Quote:
Originally Posted by coolguy27 View Post
You can use disallow on robot.txt OR METAS

put your site that you want to block from crawling..

robots.txt Code:

User-agent: *
Disallow: /


Meta Code:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This also will not work. As by design only a single robots.txt file is used per domain whether it is accessed via http and https. Your suggestion will not acomplish what was asked! What you have to do is offer up different robots.txt files dependant on if http or https has been used. This simplest way of doing this is to use an .htaccess rewrite as I suggested above.
__________________
Current Project: http://www.starrecottages.co.uk/
WorldwideTrading is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-09-2008, 10:31 AM   #15 (permalink)
Contributing Member
 
mitra's Avatar
 
Join Date: 12-23-06
Location: India
Posts: 83
iTrader: 0 / 0%
mitra is liked by many
Send a message via Yahoo to mitra
I've submitted 63 https://urls in google webmaster url removal tool . It has been 24 hours since I posted those urls. How long it will take to remove those urls? The status is showing as pending. Analyze robots.txt section shows that the robots.txt last downloaded 17 hours ago.
__________________
Watch out Refine Studio | Simply Blogging
mitra is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-10-2008, 03:38 AM   #16 (permalink)
Moderator
 
Rankenstein's Avatar
 
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
iTrader: 0 / 0%
Latest Blog:
None

Rankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest order
Quote:
Originally Posted by WorldwideTrading View Post
Actually NO IT WONT! The problem occurs when google indexes the same page twice, once under http and once via https. Your sugestion would prevent google indexing the page altogether which is uhmm wrong.
Doh! Yeah, you're right.

Put the robots text on the secure port instead. That's Google's official advice, IIRC.
__________________
Clean, Fast and Tight
Rankenstein is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-10-2008, 05:06 AM   #17 (permalink)
Contributing Member
 
sleepyhead's Avatar
 
Join Date: 07-07-07
Location: Phoenix, Arizona
Posts: 503
iTrader: 0 / 0%
sleepyhead is liked by somebodysleepyhead is liked by somebodysleepyhead is liked by somebodysleepyhead is liked by somebody
content="index, nofollow"
sleepyhead is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-11-2008, 09:23 PM   #18 (permalink)
Moderator
 
Rankenstein's Avatar
 
Join Date: 11-14-05
Location: Manchester
Posts: 3,454
iTrader: 0 / 0%
Latest Blog:
None

Rankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest orderRankenstein is a web professional of the highest order
The first answer by Worldwide Trading was spot on. Perfect example of me not reading the thread properly. Ignore anything after the second post. Thread solved and we're all too dumb to know it. 10/10 for Gareth.
__________________
Clean, Fast and Tight
Rankenstein is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-16-2008, 08:20 AM   #19 (permalink)
Contributing Member
 
Join Date: 12-12-07
Posts: 100
iTrader: 0 / 0%
Latest Blog:
None

VirtualHoney is liked by many
If i were you i would put a secondary robots.txt file that has a nofollow on https:// and for the http:// file i would put a follow robots.txt file
VirtualHoney is offline  
Add Post to del.icio.us
Reply With Quote
Old 01-17-2008, 03:42 AM   #20 (permalink)
Junior Member
 
Join Date: 01-14-08
Location: Karachi Pakistan
Posts: 11
iTrader: 0 / 0%
Latest Blog:
None

EnomSoft.com is liked by many
Send a message via MSN to EnomSoft.com
use Rebot.txt
EnomSoft.com is offline  
Add Post to del.icio.us
Reply With Quote