Crawler-Based Search Engines
Crawler-Based search engines, such as Google, create their listings automatically. They “crawl” or “spider” the web, then people search through what they have found.
If you change your web pages, crawler-based search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role.
A human-powered directory, such as the Open Directory, depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the description submitted.
Changing your web pages has no effect on your listing. Things that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed for free than a poor site.
The Parts of a Crawler-Based Search Engine
Crawler-based search engines have three major elements. First is the spider, also call the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being “spidered” or “crawled”. The spider returns to the site on a regular basis, such as every month or two, to look for changes.
Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.
Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus a web page may have been “spidered” but not yet “indexed”. Until it is indexed – added to the index – it is not available to those searching with the search engine.
Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.
Major Search Engines: The same, but different
All crawler-based search engines have the basic parts described above, but there are differences in how these parts are tuned. That is why the same search on different search engines often produces different results.
Now lets look more about how crawler-based search engine rank the listings that they gather.
How Search Engines Rank Web Pages
Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and present you with ones that much your topic. The matches will even be ranked, so that the most relevant ones come first.
Of course, the search engines don’t always get it right. Non-relevant pages make it through, and sometimes it may take a little more digging to find what you are looking for. But, by and large, search engines do an amazing job.
As WebCrawler founder Brian Pinkerton puts it, “Imagine walking up to a librarian and saying” ‘travel’. They are going to look at you with a blank face”.
Ok- a librarian’s not really going to stare at you with a vacant expression. Instead, they are going to ask you question to better understand what you are looking for.
Unfortunately, search engines don’t have the ability to ask a few questions to focus search, as librarians can. They also can’t rely on judgment and past experience to rank web pages, in the way humans can.
So, how do crawler-based search engines go about determining relevancy, when confronted with hundreds of millions of web pages to sort through? They follow a set of rules, known as an algorithm. Exactly how a particular search engine’s algorithm works is a closely kept trade secret. However, all major search engines follow the general rules below.
Location, Location, Location… and Frequency
One of the main rules in a ranking algorithm involves the location and frequency of keywords on a web page. Call it the location/frequency method, for short.
Remember the librarian mentioned above? They need to find books to match your request of “travel”, so it makes sense that they first look at books with travel in the title. Search engines operate the same way. Pages with the search terms appearing in the HTML title tag are often assumed to be more relevant than others to the topic.
Search engines will also check to see if the search keywords appear near the top of a web page, such as in the headline or in the first few paragraphs of text. They assume that any page \relevant tot the topic will mention those words right from the beginning.
Frequency is the other major factor in how search engines determine relevancy. A search engine will analyze how often keywords appear in relation other words in a web page. Those with a higher frequency are often deemed more relevant than other web pages.
Spice in the Recipe
Now it’s time to qualify the location/frequency method described above. All the major search engines follow it to some degree; in the same way cooks may follow a standard chili recipe. But cooks like to add their own secret ingredients. In the same way, search engines and spice to the location/frequency method. Nobody does it exactly the same, which is one reason why the same search on different search engines produces different result.
To begin with, some search engines index more web pages than others. Some search engines also index web pages more often than others. The result is that no search engine has the exact same collection of web pages to search through. That naturally produces differences, when comparing their results.
Search engines may also penalize pages or exclude them from the index, if they detect search engine “spamming”. An example is when a word is repeated hundreds of time on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users.
Off the page factors
Crawler-based search engines have plenty of experience now with webmasters who constantly rewrite their web pages in an attempt to gain better rankings. Some sophisticated webmasters may even go to great lengths to “reverse engineer” the location/frequency systems used by a particular search engine. Because of this, all major search engines now also make use of “off the page” ranking criteria.
Off the page factors are those that a webmasters cannot easily influence. Chief among these is link analysis. By analyzing how pages link to each other, a search engine can both determine what a page is about and whether that page is deemed to be “important” and thus deserving of a ranking boost. In addition, sophisticated techniques are used to screen out attempts by webmasters to build “artificial” links designed to boost their rankings.
Another off the page factor is click through measurement. In short, this means that a search engine may watch what result someone selects for a particular search, then eventually drop high-ranking pages that aren’t attracting clicks, while promoting lower-ranking pages that do pull in visitors. As with link analysis, systems are used to compensate for artificial links generated by eager webmasters.
Search Engine Ranking Tips
A query on a crawler-based search engine often turns up thousands or even millions of matching web pages. In many cases, only the 10 most “relevant” matches are displayed on the first page.
Naturally, anyone who runs a website wants to be in the “top ten” results. This is because most users will find a result they like in the top ten. Being listed 11 or beyond means that many people may miss your web site.
The tips below will help you come closer to this goal, both for the keywords you think are important and for phrases you may not even be anticipating.
For example, say you have a page devoted to stamp collecting. Anytime someone types, “stamp collecting”, you want your page to be in the top ten results. Then those are your target keywords for that page.
Each page in you web site will have different target keywords that reflect the page’s content. For example, say you have another page about the history of stamps. Then “stamp history” might be your keywords for that page.
Your target keywords should always be at least two or more words long. Usually, too many sites will be relevant for a single word, such as “stamps”. This competition means your odds of success are lower. Don’t waste your time fighting the odds. Pick phrases of two or more words, and you will have a better shot at success.