If you're new here, you may want to subscribe to my RSS feed 
I’ve seen many people asking about this, and I’ve seen even more people generally mystified by the way Google works. So most people don’t understand how and when Google crawls and are generally thinking it’s a secret.
It’s not really that big a secret, but it is a bit of a thing to predict when Google does come. Of course if you deal with this stuff as often as I do, you start to become used to the schedules. What is a bit confusing though is how this crawl schedule changes like hell depending on a million factors Google finds important.
We’ll start off with a bit of information from the Google Webmaster Center. As always they give us the follow the guidelines and it depends on many things crap, but a few factors come out as obvious in the process:
- PageRank
- links to a page
- crawling constraints (such as the number of parameters in a URL)
Ok, so we know what helps us. PR is the most important, then links, then the ability of your site to be crawled (that number of parameters refers to the fact that Google doesn’t like many php parameters - use mod_rewrite). So we have a starting point. But as always Google is cryptic and doesn’t really help… So we move on.
As early as 2002 people were asking about the Google crawl schedule, and some were guessing at it. However, results were strange and back them high PR sites were a lot more. However, many have seen Google full crawls at around 1st June, while another had it in May and still moving on in June. An interesting piece of info was that for large sites Googlebot came in at about every three minutes indexing about 2-10 pages a second, which I feel was a bit of a slurp but was made to keep a bit of the strain off the webserver. Their discussion goes offtopic then on, but for the purists, go read…
Our next source is a for dummies book excerpt, in which we get a bunch of terms related to the crawl. In doing research for this I was really surprised to see there’s very little info to be found. Then again, it’s not such a hot topic for SEO, but is somewhat important. They say the deep crawl occurs about every month and that fresh crawls occur randomly. Also, they consider the index as static between deep crawls, in a form called everflux in the strange update given by fresh crawls. My opinion later
There’s not much else on the web, except a mention of the Google Dance. I find all these names so amusing, since they don’t really explain the phenomenon and there’s no dancing involved. I guess they got bored of using crawl in everything. It’s basically the deep crawl, and we get the info that it usually begins at the end of the month, lasting 3-5 days, and usually updates PR. Also, for the people out there who know how to monitor server logs, deep crawl uses an IP range of 216.239.46.x whereas fresh crawl uses the 64.68.82.x range. Also at that link above you can find a so called Google Dance Tool, which could be useful to see what pages Google finds important and crawls, but you could just use webmaster tools for that.
Now for my take on the whole thing. I feel that there’s not two, but three kinds of crawls. Firstly, there’s an almost immediate crawl, from pings and links and basically whichever spider Google uses for Google alerts. That happens at once, and crawls the title and the post, but does not index it. It only notices it’s there. Then, in a few days to a week, the post becomes indexed completely, and starts showing up in Google results (on a quite high position at first, then gradually lower if no further activity on that post is detected, or no search activity for that keyword is detected). The next kind of crawl is a longer-term crawl, which usually includes the homepage, and is done every week, or two weeks, or even a month for less active sites. This updates the cache on your active pages, but doesn’t touch the others. And the last kind of crawl happens about three or four times a year, and reindexes everything. This usually happens in February or March, June, November, or in some cases any other month. Google tends to vary this stuff, presumably due to factors on and off the site. So be prepared for a couple of crawls this year in June (beginning) and mid-November or so, and see if it happens as I’ve predicted.
One more thing, an important factor to crawling is the kind of servers you are hosted on. Use GoDaddy or any other established host rather than hosting on your old machine, so Google can download the data properly. The crawl intensity depends a lot on that. Also, Google does not have the same schedule as Yahoo for example. Yahoo just performed a deep crawl for my site a few days ago, whereas Google didn’t. So if you’re interested, here’s a pretty graph to oogle at - not much data yet, but still representative:

Green is Yahoo, blue Google, and that other thing MSN. And with this I must end this post. Enjoy
Also, for more info about Google crawling check out this older post called Google secrets: How to speed up Google Crawl Rate
.
Don’t forget to
subscribe to the feed
or who knows what you’ll miss out on. You can also subscribe by email. Keep tuned people


April 12th, 2008 at 6:56 pm
I think this is a good piece of information to have, as it empowers the SEO consultant on planning for results as they work toward getting higher rankings on a site. Given the time factor one could definantely plan out marketing. So good article with many links. One other thing it’s really had to tell where google bot will go next, but where it’s been is of course a good indicator.
April 12th, 2008 at 7:34 pm
@ Anonymous:
This is most useful when updates to a website are slow, or when you’re trying to sync up a SEO campaign to a date and don’t want Google to miss out on your efforts. However, in most cases the fresh crawls take care of most content, and the deep crawls are only for seldomly modified stuff (like contact info perhaps). However, having a sitemap could solve that as well.
You have been penalized for using a generic name while commenting.
April 13th, 2008 at 5:01 pm
Thanks for the read. I know that a lot of this is observed behavior (as you stated) so each can draw their own conclusions.
I do see a sympathetic trend in activity in your chart with the three different engines. But, as you also stated, there is not a long enough history to draw accurate conclusions.
April 13th, 2008 at 6:30 pm
Yes, it is rather empirical, I would need at least one year of recorded data to put out an actual theory, memory tends to deceive you after a while…
what I have seen to link up perfectly with the crawl is traffic. Google seems to know when you get more traffic and thus syncs up crawling with that. For me it’s weekends when fewer people come (and therefore I don’t post anything)and it matches the reduced search crawl activity.
glad you liked it and thanks for the comment
April 14th, 2008 at 12:09 pm
As a matter of adding in a bit of extra information: our primary website received its first PR update from Google 1st February ‘07 (It was less than 1 year old at the time) and then had it’s second PR udpate 1st Feb ‘08.
April 14th, 2008 at 8:06 pm
thanks a lot for the extra input Dan, I always like your comments
it’s interesting about 1st february, I personally run a couple of sites and they had a full crawl in mid January… Nothing regarding PR though, since they’re pretty new (one is really old but has had almost zero traffic).
April 23rd, 2008 at 7:30 am
Your stats posted look pretty intresting. Which software did you use to render the graph.
April 23rd, 2008 at 7:40 am
@ Jon Downes:
i used a wordpress plugin which shows you crawls. i think it’s something along the lines of wp crawl stats.
April 29th, 2008 at 1:58 pm
Are there any tracking softwares that will detect when you have been crawled? (not for blogs, but for static sites)
Was told that the bot usually enters your site through the same page, and that frequent updates of that page will make the bot come more often?
And also, if I have an .xml sitemap, do I need to have a html sitemap as well? (For indexing purposes)
Thanks,
Eva
April 29th, 2008 at 5:40 pm
well, there probably are, google for them
the bot usually enters the site through the homepage to gather links but otherwise it enters through whichever page it’s crawling through. and yes, if the homepage is updated often you are more bound to have your crawl rate grow.
technically it’s a good practice to have both kinds of sitemaps (so you reduce the number of links on the homepage yet still have a way for google to naturally navigate your site - and let’s not forget Google is not the only search engine out there…)
May 7th, 2008 at 3:19 pm
This is great, glad to know this tips. keep up the good work!