PageRank Flow, Comment Feeds in Supplementals, NoFollow & robots.txt – bifsniff.com
Well there’s no time like the present to try and clear the last of the IBA free site analyses. As they say – better late than never!
Please Note: While the following analysis relates to bifsniff.com many of the suggestions made are highly relevant to any website.
Bifsniff.com – formerly the cartoon guys
I had the fortune to meet Frank a few months back when he came along the the inaugural ShareIT event in Cork. Frank has also been kind enough to let me know every time WordPress has decided to rewrite my .htaccess file (which is more times than I care to mention – damn WP).
Page Navigation
- Fist port of call – the Canonical URL
- Comment Feeds in the Supplemental Bin
- Why pages go supplemental
- Controlling search engine access to your site
- Don’t channel PageRank to useless pages
- Having unique page titles and descriptions
First port of call – the Canonical URL
The very first thing I check when I go to a site is whether or not it resolves via a canonical URL. This all sounds very technical, but in fact it’s a very simple test:
- In your browser address bar type your website address WITHOUT the www.
- Now do the same thing again, except this time with the www.
*IF* your site appeared for both addresses *AND* the address bar didn’t change for either to the other (i.e. you typed www.[mysite].[myTLD] and the address automatically change to [mysite].[myTLD] without the www.) then you are effectively publishing the same site two times. This is known as the “Canonical URL” issue.
You see if Google can access your site via both www. and non-www. addresses it sees these as two different sites. Google does a pretty good job of filtering out one or the other from its results, but where this can hurt you is your backlink profile. Say lots of people have linked to your site and those links point at both the non-www. and www. addresses more or less evenly. Well under this situation you are effectively diluting the link love by splitting it between two sites. Now if you go and set up a really simple redirect from non-www. to www. (or vice-versa) you’ll effectively double the link love in this example. his could have an effect on how your site ranks overall.
Now if Frank is reading this he’s probably saying ‘tell me something I don’t know’. That’s because Frank has mastered this a long time ago. Try typing www.bifsniff.com into your browser. Now take a look at the address bar – no www. there now is there?
Comment Feeds in the Supplemental Bin
Frank mentioned that he had a lot of pages in the supplemental index. In particular the comment feeds seemed to get supped. This is quite regular in fact. The comment feed is generally only linked to from within a post itself, and rarely will you have external links pointing at your comment feed URL.
Curiously this issue has been the focus of quite a bit of disucusion (overview here, some more here) in the SEO field recently.
Before we go any further let’s take a step back and look at the problem faced by bifsniff.com. First we need to take a snapshot of the pages indexed in both Google’s main and supplemental indices. The following advanced operator commands will help us:
- Total indexed pages:
site:bifsniff.com
- Pages in supplemental index:
*** -RCredCardinalIE site:bifsniff.com
Query 1 gives us the total number of pages indexed (2,060), query 2 the number of pages in the supplemental index (1,180), and the difference between the two (880) the number of pages in the primary index. The comment feeds are of low value and deservedly end up in the supplemental index. This is quite normal, and generally wont hurt your site. An argument can be made, however, for trying to reduce the number of pages indexed in order to ensure that Google gets your most important pages into the primary index.
Another Step Back – Why Pages go Supplemental
To better understand why you want to control what pages get indexed you need to know why pages go supplemental. There have been many rumours and myths about this topic. However recently Google has come out and said on a number of occasions that there is only one reason why a page will end up in the suppelmental index – Lack of PageRank
Get more quality backlinks. This is a key way that our algorithms will view your pages as more valuable to retain in our main index.
Source: Adam Lasnik comment here
…the main determinant of whether a url is in our main web index or in the supplemental index is PageRank.
If a page doesn’t have enough PageRank to be included in our main web index, the supplemental results represent an additional chance for users to find that page, as opposed to Google not indexing the page.
Source: Matt Cutts Google Hell post
PageRank is passed from one resource (page) to another via links. A collection of pages that forms a any website will therefore have a calculable amount of pagerank to share between those pages. Let’s take a very simple example to show this.
So let’s say that your site has ’6′ PageRank units to share amongst its pages. All external links point at the homepage only (rarely the case, although the homepage regularly has the highest PageRank). Here’s how the site might look:
The homepage (‘Home’) links to 3 sub-category pages (‘SubCat1′, ‘SubCat2′, ‘SubCat3′). So each of these sub pages receives 2 PageRank units (i.e. 6/3) from the homepage. In turn each of these sub-category pages links all other sub-category pages, 2 inner pages, and back to the homepage (this is a classic ‘silo’ architecture).
- the homepage funnels PgaeRank to 3 sub-category pages
- each sub-category page funnels PageRank to 2 other sub-category pages, two inner pages, and back to the homepage
- each inner page funnels PageRank to 1 other inner page and back to its parent sub-category page
There are many reciprocal relationships within this very small example, and calculating the actual PageRank in and out of any page can become very complex as the number of pages and links on each page increases (I’m not even going to try).
What should be obvious though is the fact that reducing the number of pages which share the initial PageRank should increase the PageRank shared by the smaller page set. That in turn may result in some additional pages coming out of the supplemental index and into the primary index.
In bifsniff’s case they have too many pages and either not enough PageRank to support all those pages or PageRank is not being filtered optimally to support each page.
Controlling search engine access to your site
One trick here is to specifically exclude pages that you don’t want indexed. In the case of WordPress feeds you can use this useful plugin written by Joost DeValk. The plugin ads a NoIndex tag to your feeds so they will be followed but wont get indexed.
In terms of the comments feeds that end up in the supps, well it would be as well to just add a NOFOLLOW to the links. This will take a bit of digging into your code as the link text is generated within /wp-includes/fedds.php
:
92. function comments_rss_link($link_text = 'Comments RSS', $commentsrssfilename = '') {
93. $url = comments_rss($commentsrssfilename);
94. echo "<a href='$url'>$link_text</a>";
95. }
Line 94. needs to be changed to:
echo "<a href='$url' rel='nofollow'>$link_text</a>";
That should place the required “NOFOLLOW” value into the rel
attribute. Those feeds should then no longer be indexed in Google. By changing this code (rather than using Joost’s plugin) you get the benefit of retaining your main feeds indexation while keeping those pesky comment feeds out of the index.
Don’t channel PageRank to useless pages
When I performed the site: operator commend in Google i got the following results:
Here’s the ‘Secret Page’:
That page has a PageRank of 4 and no external links according to Yahoo!. The same goes for the Authors Login page.
I would NOFOLLOW those links (this might require some hard coding hackery) and exclude those pages within my robots.txt file:
User-agent: *
Disallow: /secret/
Disallow: /private-authors-area/
Any other pages that are not adding to the user experience could also be removed in similar fashion.
Having unique page titles and descriptions
I noticed that bifsniff uses an identical META description throughout the site. I think it is always better to make the page META as unique as possible for each page. There are two benefits here:
- search engines generally use the META description as the snippet, so you should view your META as a call-to-action;
- unique META data *may* assist you when a page is at the margin of being duplicate content.
If you use WordPress there are a number of plugins available that allow you to add unique META descriptions and keywords to each page and post.
Conclusion
Well hopefully there’s quite a bit for Frank to go with there. There was a few other other items that I came across (linking to archives) but I think this post is quite long enough.
Hopefully this post will explain to people how search engines see each site in aggregate and how Google in particular decides which pages to include in the primary and supplementary index.
If you have any questions, or any of the above technical issues need de-mystifying please do leave me a comment below and I’ll try to better explain. (Glances at Aide from www.simplythebest.ie)
Another great article Richard!
Regards,
Martina
Comment by Martina Skelly — June 7, 2007 @ 2:04 pm
Thanks Richard,
that’s a very good point about the Secret Page, and also the Authors page, both password protected. Now, I must find out how to add no-follow to specific pages withing WordPress
It might be an interesting experiment to exclude the comment feeds from being indexed and see what impact it might have.
Thanks again for taking this time to go through the site, much appreciated. It’s always good to have fresh eyes on something!
Particularly appreciated as I realise how busy you’ve been of late!
Comment by frankp — June 7, 2007 @ 3:18 pm
ps I never really got to the bottom of why Wp rewrited the htaccess file, but I did discover a plugin that looks after the canonical issue, it resolves to a no www:
http://photomatt.net/2006/06/13/wordpress-no-www/
I think this would probably resolve any issues, but I don’t think I have tried it on a site experiencin the issue yet – I really wanted to figure out what was causing WP to rewrite the file when it really shouldn’t be, but after a certain amount of investigation I got confounded by it!
On BifSniff the htaccess file is simply not writable so it never had that issue, I update the htaccess file by hand. But as you know I have had the issue of WP rewriting the htaccess file on other sites!
Comment by frankp — June 7, 2007 @ 3:22 pm
@Martina – hi and thanks.
@Frank – would you believe that I only re-uploaded my .htaccess file yesterday and the bloody thing has been overwritten today. I’m gonna find what it is that is rewriting the file if it’s the last thing I do [today :)]. Hope there is something of value to you above.
Rgds to both
Richard
Comment by Richard Hearne — June 7, 2007 @ 3:30 pm
Richard
If you change the permissions on the .htaccess so that Apache can’t write to it your problems will go away
Michele
PS: Nice article
Comment by Michele — June 8, 2007 @ 7:39 pm
Hi Michele
I actually dug into the WordPress Codex and have a fix for the .htaccess problem – extremely simple.
I have a post queued that explains it.
Rgds
Richard
Comment by Richard Hearne — June 9, 2007 @ 8:10 am
Hats off Richard,
Look forward to reading that post!
Comment by frankp — June 9, 2007 @ 3:45 pm
You should nofollow your trackback links too. I hadn’t thought of this and I had use robots.txt to disallow access to those URLs.
On the other hand, read through http://www.seorefugee.com/seoblog/2007/06/04/john-chow-creates-seo-experts/ to cast some doubt on the benefits of excluding these pages.
Comment by Donncha O Caoimh — June 12, 2007 @ 2:50 pm
I almost forgot, instead of modifying a core WP file, change your theme to call get_post_comments_feed_link() instead of comments_rss_link().
Next time you upgrade WordPress you won’t have to worry about losing your valuable custom changes!
Comment by Donncha O Caoimh — June 12, 2007 @ 2:54 pm
Hey Donncha
What does get_post_comments_feed_link() do? And does this function need to be updated to include a rel=”nofollow” attribute?
TBH you could actually give these links a rel=”noindex” so that the bots would still crawl your RSS feeds, just never actually index them. It’s a bit of a toss-up though.
Cheers for contributing.
Rgds
Richard
Comment by Richard Hearne — June 12, 2007 @ 3:03 pm
Edit your comments.php or single.php and change the
to
The text ‘feed’ can be whatever you desire of course.
Hint – if there’s a WP template tag that prints something, then chances are there’s a get_* function that gets that value first
Comment by Donncha O Caoimh — June 12, 2007 @ 3:07 pm
Donncha
I don’t like the idea of disallowing trackbacks and adding nofollow. It does away with the entire “interconnected web” concept in my mind, which is why I am so happy with the way MT handles it.
Michele – who is trying to migrate away from WordPress
Comment by Michele — June 12, 2007 @ 3:16 pm
Michele – I mean to nofollow the /trackback/ link.
Comment by Donncha O Caoimh — June 12, 2007 @ 3:17 pm
Donncha
Ah.
That makes more sense
Michele
Comment by Michele — June 12, 2007 @ 3:20 pm
@Donncha – WP bastardised your html – can you repost pls? (can you
<pre>
it? [Edit - that doesn't work either]I read that John Chow story you linked to – the only thing there was that he had gone a little strong with his robots.txt and dissalowed most of his site. I think there’s definitely value in removing the comments feed from the index though.
Rgds
Richard
Comment by Richard Hearne — June 12, 2007 @ 4:05 pm
Richard
The feed maybe, but the comments themselves are golddust
Michele
Comment by Michele — June 12, 2007 @ 4:14 pm
Michele
Yep, just the comment feeds. You’d want to be mad to try to remove the comments (even if you could, which you cant without removing the post).
All I know is that you take a look in the supplemental index and you’ll regularly find the comment feeds in there. Makes sense as they have very little linkage to them and are highly duplicitous of the page they come from.
Rgds
Richard
Comment by Richard Hearne — June 12, 2007 @ 4:24 pm
@Donncha – Fatal error: Unknown function: get_post_comments_feed_link() in /home/cardinal/public_html/wp-content/themes/classic/comments.php on line 26
Seems this might be a new function? I’m on 2.1.3 but will probably update to 2.2 now.
Comment by Richard Hearne — June 15, 2007 @ 11:37 am