How To Get Your Site Out Of The Supplemental Index -Krishna De
I met Krishna De in Cork last month. She gave a fantastic presentation on marketing and leveraging the Internet to achieve your business goals. In fact, without prejudice to any of the other speakers, I found that Krishna’s topic area was of the most interest to me. Krishna also availed of my offer for a free site review. So without further ado…
KrishnaDe.com
I have to say I have always admired Krishna’s website. It is just well polished from the get-go. The homepage just speaks ‘professionalism’ to me:
If I were to find any fault it would be with the footer – I can’t easily discern between text and links. But that would just be nit-picking.
More than meets the eye
It was only when I sent in a spider that the true size of Krishna’s site became apparent. I knew that her blog has been on-line for a number of years and so expected the blog to be quite extensive. But I hadn’t expected this:
Crawler 1: 2,306 internal pages
Crawler 2: 2,604 pages (some external)
A look at Google’s index shows that Krishna’s site has a high number of pages in the supplemental index:
Pages Indexed: 1,330
Pages Supplemental: 964
That’s a particularly high proportion of supplemental:indexed pages, and to me this is the most pressing issue for Krishna.
A robots eye view
Here’s Krishna’s robots.txt file:
User-agent: *
Disallow: /_mm/
Disallow: /_notes/
Disallow: /_baks/
Disallow: /MMWIP/
Disallow: /audio-for/
Disallow: /private/
Disallow: /onlinebrand/User-agent: googlebot
Disallow: *.csi
When I look at some of the files that have made their way into the supplemental index I can see immediately that many should not be indexed in the first place.
HOLD PRESS – I’ve just noticed that Krishna’s site has been hacked:
Those links at the top of the page shouldn’t be there. That’s taken from Google’s cached version of the page. Here’s the original page. This type of hacking is normally carried out by altering the .htaccess file to cloak your pages for GoogleBot. Normal users are shown the second page, while Google sees the page with the links.
I’ve seen this hack a lot recently. The best medicine is to make sure that your software is up-to-date. There have been issues with WordPress, and that’s why the WordPress guys are very much on the ball with updates. You have to carefully check your server to see what else has been left around. The first file I would check is .htaccess, although in this case I have a feeling there may be a bit more going on.
I cant tell for sure if Krishna has fixed this. This hack might be a bit more elaborate than normal user agent sniffing. When I access the page as GoogleBot I get the clean version so the hack has either been treated, or is using a IP delivery or reverse-lookup to only cloak for the real GoogleBot. I sent Krishna a mail as soon as I found this so hopefully she already knows about it and had it patched.
Back to work…
There’s not a lot I can do while I wait to hear back from Krishna. So I’m going to go ahead with what I think Krishna should do to fix the supplemental issues.
The crawler found 2,306 resources in Krishna’s site. it also found about 100 cases of duplicate content covering about 250 pages (the homepage was accessible via 4 URLs). Most of the duplicate content came from the trailing slash problem. Krishna can solve most of this by installing a small WordPress Plugin called Permalink Redirect.
Next step, Krishna needs to update that robots.txt file. I would add in the following to stop Google crawling certain areas of the site:
User-agent: *
Disallow: /_mm/
Disallow: /_notes/
Disallow: /_baks/
Disallow: /MMWIP/
Disallow: /audio-for/
Disallow: /private/
Disallow: /onlinebrand/
Disallow: /learningzone/
Disallow: /blog/wp-content/plugins/
User-agent: googlebot
Disallow: *.csi
Soemwhere in Krishna’s blog she has linked to her plug-in directory. The result is that Google has indexed a tonne of files from her WordPress Plug-in directory. This has two effects:
- increases the site size, and therefore the Pagerank needed to carry each page;
- decreases the Pagerank passed to each page as there are more internal links than needed.
So not only should Krishna remove the links to those pages, she should also make sure that the bots no longer crawl resources that shouldn’t be in the index. The two most obvious offenders I could see for low-value filler content were Learning Zone (/learningzone/) and the plug-in directory (/blog/wp-content/plugins/). So I’ve disallowed the bots from those areas.
Calendars can drive bots batty
I’ve found that dynamic calendars are very often the worst culprits for driving search engine bots around the twist. And Krishna’s site hasn’t let me down. Within the LearningZone there is a dynamic calendar. This is just one more reason to keep the bots out of there.
Permalinks
I notice that the crawler came back with a large chunk of default WordPress page URLs. These are the URLs that look like www.mysite.com/?p=1234. Krishna must have changed over to the more SE-friendly permalink structure, but not changed all her internal links.
Although there could be quite some work involved, I think it would be useful to fix this issue. I saw some duplicate content issues due to the use of both default and permalink structures. If you are interested in the duplicate content URLs here’s the full report:
Other thoughts
My eyes are getting a bit weary now, but there are just a couple of other thoughts on Krishna’s blog.
Internal linking can be a great way to help your pages rank well. For a start you can control the anchor text used, and anchors are what give relevancy to the linked material. Google loves anchors, so don’t use ‘click here’ or ‘look at this’ where you could use great descriptive anchors for your links.
I looked through some of Krishna’s posts and the thing that struck me was the lack of links. A great way to keep posts out of the supplemental index AND boost your internal traffic is to cross link in your posts. If you discussed something previously which is related to your current post then link to it. And use good descriptive anchor text in your links. It’s amazing how just one or two good internal links can see pages jump out of the supplemental index.
I hope Krishna has fixed this up
It’s such a pain in the rear when hackers get into your site. And it goes to show that you can never be too careful with the security of your website. Hopefully Krishna either has this sorted or soon will.
And if you want to see a great example of a blog that shows you what on-line marketing is I would strongly advise that you head over to Krishna De’s website.
Great review! Catching hacked sites like that is very hard to do, especially when/if they are fixed and the owner “forgets” to tell you about it. It can have a giant impact on indexing and ranking.
Just something tiny – you might want to change
Disallow: *.csi
to
Disallow: *.csi$
Which would make sure that the “.csi” is at the end of the URL (which would however also still allow dynamic URLs, if you have any, eg “/page.csi?param=value”, I’m not sure if this is the case here). Could allowing other bots to crawl the .csi-pages cause problems in the future? Perhaps it would make more sense to rewrite those URLs so that they can be disallowed on a folder-level.
Comment by JohnMu — April 10, 2007 @ 8:47 am
Richard, a problem I noticed is there is no link to the homepage from the blog. (none that I could find anyway)
The blog calendar is a dreadful idea alright. It generates links in a different format, something like month/week/day so this can result in duplicated content as the content is accessible via an alternative link too.
Comment by Cormac — April 10, 2007 @ 9:53 am
@John – TBH I’m not familiar with that extension and I didn’t have the time to go looking for the files in question (if they even exist on the server). Guess who’s application came in handy for some of this?
@Cormac – I didn’t notice that the homepage had no internal links pointing at it. Good spot. Calendars are dangerous for SEs.
Comment by Richard Hearne — April 10, 2007 @ 5:54 pm
Richard, John and Cormac – thank you all for your comments.
@Richard – thank you for the feedback on the event – we are working on making the content available as an on-demand webinar as I recorded the audio so we are linking them with the slides which I hope will be of interest to people who could not attend.
@Cormac – thank you for pointing out that there is no link to the home page – the Krishna De site is separate to my corporate website and is supporting me in building my personal brand online. The blog has been a terrific way to enhance my brand online.
Do you have any recommended tools other than the events tool I use on my blog? I have this plugin there as I want to have an easy way for visitors to see the fothcoming seminars and events I am leading.
@Richard – yes I did change the permalinks to be more seo friendly so will need to go back to change them. Any suggestions on how best to do this or do I need to work through post by post and page by page?
Thank you once again for the review and I have plenty to be working with here to support enhancing search engine ranking so I build an even stronger personal brand online!
Comment by Krishna De — April 12, 2007 @ 8:03 pm
Hi Krishna
I think Cormac was referring to a lack of a link from http://www.krishnade.com/blog to http://www.krishnade.com.
The calendar that I saw causing issues wasn’t on the blog but on the learningzone area. I imagine that WP calendar plugins probably have NOFOLLOW (your one seems not to though). I haven’t used any calendar plugins so I cant oblige
I imagine you would have to go back and change those links by hand. At least any links that are within posts.
Rgds
Richard
Comment by Richard Hearne — April 12, 2007 @ 10:26 pm
Richard – yes I understood the link cormac mentioned. Thanks for clarifying it is only the learning zone calendar that is the problem.
Comment by Krishna De — April 12, 2007 @ 11:41 pm
Krishna, just to re-clarify – I *think* that the only issue was the learningzone calendar. I’m not so sure about the WP calendar. It may or may not be causing issues. But the priority is to block access to learningzone in your robots.txt.
Rgds
Richard
Comment by Richard Hearne — April 13, 2007 @ 6:50 am