problem with search engines

INFOBOX

This article criticizes search engines and especially web robots. It exposes some problems with search engines and inadequate treatments of the problems. While not technical, it presupposes some knowledge of search enignes.

Very often you hear people say: "Internet is the global network of resources, the ultimate source of information". Well, that might be true, but anyone who tried to search for a particular topic using a search engine will know that we spend sometimes up to half of our time simply trying to find what we need.

Search Engines: Critique

Introduction

I hate search engines. I didn't hate them at first, but now I definitely hate them. I hate search engines and people who created search engines and people who designed search engine sites.

Web search engines should have made our lives easier. You go to Infoseek website, enter your search string and voila - the list of all the pages on the topic "magically" appears before your eyes. But in practice you search for "web design links" and the best a search engine will come up with will be 50 web design companies, 150 Amazon books, 30 web promoters, 1000 spammers and 3 design links pages, that have been closed a year ago.

All right, I may be too harsh, if you are an experienced "surfer", you will find what you need after 3-4 tries. But even though you might find what you need, the site you've found doesn't need to be the best Internet has to offer. You've found it just because the site was first in the result list.

If you've read the first article in the "Promotion" section, it talks extensively about the search engines, META tags and gives some advice. But is this advice useful in practice? The rest of this article will try to explain how search engines work, why they do not work and how the submissions are ranked. This article should be read after the "web promotion" article in this section, or by people who have experience with search engines. A very good information source on web search engines is at http://www.searchenginewatch.com, if you need a good introduction to search engines.

Part 1 : How search engines work

All right, you should already know that there are two types of search engines: web robots and directory listings. Web professionals do not like calling directory listings search engines, but common people can and so I will. Yeah, it feels so good to be in control.

A web robot is a program that accesses your page and "traverses" (well, the best alternative to this word will be analyses) your page. There are two steps in analysing: first, some words are entered into a special index; second, any links you might have on your page are also traversed. So if your page consists of 100 pages, all linked through a main page, in theory a web robot will create an index for all the pages on your site. So you only have to "ask" a robot to check your main page. Other pages will be indexed automatically.

What is "indexed"? Well, some people don't realise it, but web robots create an index of EACH page they traverse. For example, you have a page on old vinyl. A robot will enter your page, read it all and store some of the words in your page as an index. Of course, an index should not be too large, no more than 2kb. It makes sense, because if index was more than that, it's like search engines will have the entire Internet on their disks.

Note. Some search engines claim that they store the entire text of all the web pages they store. This might be true (why should they lie) but then search engines that claim that usually have some kind of a "discrimination" function, storing only main pages or storing only reviewed pages, etc.

So, search engine is a collection of these small indexes of millions of pages. But how on earth can a robot decide which words to include in an index and which to throw away? Well, this is the problem with robots. Isaac Asimov led us to believe that robots are intelligent beings, capable of running life for humans. But web robots are stupid. Web robot is simply a small program, not capable of any intelligence. It is a disgrace to all future robots to have parents like that!

And this is where all the problems start. We know that web robots are dumb. Programmers know that their programs are dumb. And dumb creatures can be fooled. And indeed they are. For example, if we have two identical pages on web promotion, mine is exactly like yours, only I've added 1000 sentences with words "web promotion" at the end. Which page will have more chances to be displayed first when a person searches for "web promotion"?

Am I able to tell you how exactly web robots index pages? Well, no. Programmers do not want to give us exact algorithms, because this would lead to further abuse by the people who are willing to spend their time studying the code. We end up in a Catch-22 situation, where we have to know how web robots work, to be able to provide better matches, but as soon as we know how the robots work, they get abused and the best matches are again off by mile from the topic you want.

There are, of course, some rules that are well-known. Smaller pages are given a preference over longer pages with the same content. Pages with a search string in a title are given a higher "relevance rating" (brought closer to the top of the search result list). Some search engines increase a relevance rating for popular pages (pages that are demanded more often), so that new pages will fall into an "anonymity trap" - new pages need more exposure, but they are given less, because they are not popular.

Because search engines cannot intelligibly create a search index, there is a way to create it yourself. You can create it by using META tags. You should already know what are the META tags, if you don't, read the "Web Promotion" article. All right, so you create a search index. It is an improvement, but it makes search engines more open to abuse. Before the META tags, in order to rank high in a search for a keyword "market economies", you had to use the phraze repeatedly throughout the text. Now, you simply provide it as a keyword. And spammers (nasty people, who would kill their mother to increase their search engine rating) know this just too well.

Of course, human-made search index is better. It might be that in my article on web design I never even say "web design", because it should be obvious to everyone. But if I rely on computer to create an index, I would have to use this "dumb" style of writing. E.g. "This is a web design guide. I can also say it is an html tutorial. If you're interested in Hypertext Mark-up Language you should read this". Now you know why some web pages are written in a very peculiar style.

Part 2 : Where search engines fail

You might feel very happy, now that you know how search engines work. Of course, you don't know how they really work, but you know how to create an index, so you might think that this will be the end of your problems.

Alas, this is not so. I've spend in total more than 40 hours trying to understand how search engines work and I still have no faintest idea. Sometimes you're just lucky, you make a page, create an index, submit a page and... you get 1000 hits a day, because you're the first page displayed when someone types "Spice Girls".

But usually the results are not that great. To give a practical example, I submitted webIGN site to the major search engines by hand. I've written an extensive keywords list of about 1000 characters. After submission, I was unable to find my page using the keywords I provided.

So what did I do wrong? Well, web design is quite a popular area, so there is a healthy competition. Suddenly, a very bright idea landed within a reach of my brain. I have to see how the competitors do it. So I went and searched for "web design guide". The first two entries were dead links (!), but the third was a working site. This site has been around for only couple of months, a bit older than webIGN, so I took a look at the document source, noted that the title of the site was "A web design guide" and there was a keyword "a web design guide". So, fear no more, I thought. Just mimick the style of this page in my page and you will be given a #3 or even #1 position.

Well, I submitted my new updated page to the search engine (HotBot), waited for couple of days and finally received an e-mail saying that my page has been stored in the HotBot index. At that time I could smell victory. Finally, millions will see me! To my surprise, when I searched for "a web design guide", I found that the index did not change a bit. Actually, the only way I could find my page was if I searched for "webign". That will teach me to call the site by a proper name.

So, even if you take a look at competitors, and I suggest you do, because you will be surprised how looking at similar sites will help you improve yours, you're not guaranteed the top place in the list of sites.

The only understanding I have gained is that it is very difficult to get "recognition" in well-established and popular topics. If your page deals with "Russian pre-revolution stamps", you will have a lot more exposure. But as soon as topic becomes popular, it becomes so hot in there, the search engines do not let any new entries to the top.

Part 3 : Tell me why, tell me why

Well, an impressive end to this guide would be an answer, why some pages are treated more favourably than others. One search engine (I think, Alta Vista) displayed some rubbish links when searched for "web design guide", even though my page had "web design guide" in the title, in the text and in the keywords. I guess, programmers aren't telling us the whole truth, then. It might be that established sites are treated preferably; or because of bad luck your page might not be indexed the way it should have been; search engines might have a degree of randomness in them, displaying sites randomly. The truth is, I don't have any idea. When the reason fails, people start thinking magic.

I have written this article couple of weeks ago and I spent some more time trying to figure out what could go wrong. Well, I have amassed several ideas. I will list the problems and workarounds below.

First, to see how your web site is scoring with search engines, you can use position analysers. There are several such analysers, most are being listed on www.searchenginewatch.com. There is a standalone software that you can use from home, but it is quite slow, nevertheless, you can buy it or download a trial version from www.webposition.com.

Second, when you decide to "mimick" high-flyers, you have to keep in mind that index of a web page, that is stored in search engine memory might be outdated, i.e. the web page in question was changed. So even if you copy the whole web page and submit it, you may not score high, because the web page has been changed. Most search engines claim that their index is updated bi-weekly, i.e. all the web pages in index are re-indexed with a new index each two weeks. This should keep the index up to date and should get rid of "dead" links. However, these claims are usually not true. The index is updated less often.

Third, you might be a victim of URL discrimination. Some URLs are blacklisted and are not accepted or are accepted with low-priority flag. E.g., my tripod.com domain makes me a rogue trader in the eyes of most search engines and hence I will seldom get any listing from a search engine at all.

Fourth, remember the power of complaint. If something goes wrong and you are unsure, just write to an administrator. You won't believe how people are willing to help you solve your problems or advice or even give preferential treatment to your application. But don't abuse this. Administrators are also people and hell knows how many letters they get. Try to be professional and state your problem clearly... oh well, all right, this is not a letter-writing tutorial.

Ok, I am off to find the meaning of life and I will opt to use Yahoo for this one.