Is Google=Internet?

Searching the Web is easy. The difficult part is to find what you are looking for. While the search engines and mainly Google have done miracles in making millions of pages accessible, finding what you need when you need it is not always an easy task. And the truth is, that the ease with each Google has made it possible to retrieve most of the information we need, has made us lazy to search in depth, when the stuff we are looking for is not right at the the top of the first ten search results.

Also, as my experience shows, the majority of users rely heavily on Google only – if something cannot be found via Google, then it simply does not exist, which is certainly not so. When Google cannot immediately retrieve a piece of information, most people just conclude that only what is retrieved exists and give up. I have often been in a similar situation – no matter how sophisticated queries I perform, I cannot find what I need but I am sure that it exists. Over the years I have developed a habit to keep an archive of important URLs and files that I stumble upon incidentally and that I might need some time in the future, but I am clever enough to know that even if I could make my personal mirror of the Net, this could hardly be more useful than Google.

Also, I am aware of the way search engines work and despite their revolutionary achievements, I do not expect them to be perfect. I know that even the most powerful search engines cannot index every single page on the Web and include them in their databases. And since search results display only information that is indexed and is in the search engine’s database, if my stuff is not indexed and is not included in the database, I stand no chance of finding it at all. I have already learned that search engines might be the easiest way to search the Web, but certainly they are not the only one. Besides, sometimes (for very specific searches) the major search engines just waste my and drown me in so much irrelevant information that I regret having to use them and resort to alternative means to make my way through the Invisible Web.

The Invisible Web

The Invisible Web is a term I like very much because it describes precisely the situation when I know some piece of information is on the Web but I cannot see it. It is that vast part of the Net that search engines do not get to (due to different reasons, as I am going to tell next) but still can be accessed in other ways.

Maybe it is necessary to explain that not all pages that cannot be retrieved through the search engines belong to the Invisible Net. For instance, the Opaque and the Dark Net are two other places that are hidden from the world because the Opaque Net is files that are not linked to other resources and cannot be accessed and the Dark Web is invisible on deliberately – i.e. Corporative networks, sites with special membership and other similar places that do not welcome strangers. To get to Opaque and Dark Web sites, you need to know their URL in advance (for instance from a friend of yours) and if necessary, to have a user name and a password.

The search ideas I am going to give you in the next sections apply to the Invisible Web only and are unlikely to give results for the intentionally hidden parts of the Web. But even the Invisible Web alone is a pretty vast space. It is estimated that it is up to 500 (yes, five hundred) times the size of the Surface Web (the part that is indexed by search engines) and the tendency is that the Invisible Web will grow both as a percentage and in absolute figures. And what is more, really valuable stuff is hidden in its debris.

What Is Hidden in the Debris of the Invisible Web?

The short answer is – many essential items are hidden in the debris of the Invisible Web. It is true that the information there might not be interesting for everybody but if you are looking for a very special piece of information, no matter what topic or area, it is quite probable that it is buried on some other site together with many other topics of interest to you. Most often the stuff that cannot be found via the general search engines (but is accessible by other search means) is like the following:

  • Dynamic, database-driven sites, that are publicly accessible but due to technical reasons search engines often skip their content when indexing the Web.
  • Archives of articles in online journals and magazines
  • Specialized databases that are not of interest to the general public – medical, scientific, legal, etc.
  • Different catalogs – of products, of libraries, etc.
  • News and newsgroup postings – although very often, when I search the Net I encounter newsgroup postings from five or more years ago, when searching for recent ones, the “deliverables” of the search engines are far from satisfactory.
  • Legal and administrative information (court records, patents and trademarks information) that is available on request
  • Classifieds and advertisements, Yellow and White Pages
  • Stuff that search engines exclude on deliberately – for instance files with particular extensions, data that is regarded to be private, or content that the owners of the site has asked explicitly to be removed from the search engine’s index.

Alternatives to Google for Searching and Being Found

It is a fact that Google has been doing so much to make information accessible but monopoly has never been good. So, the first thing you can do, if you cannot find the stuff you need via Google, is to try a different search engine. Since the databases of the search engines differ, it is likely that if you cannot find it with Google, you might be able to find it with Yahoo, MSN, Altavista, or another search engine. Even if this does not lead to the desired result, do not give up – there are many other tools to use when hunting for information.

Well, the alternatives that exist might not be as easy as a Google search and they might remind you of the pre-Google times but the results that you can retrieve through them can be very rewarding. The alternatives (both for users and for site owners) include:

  • Specialized search engines
  • Specialized search directories
  • Meta search engines
  • Invisible Web databases
  • Specialized portals
  • Reference libraries

The list is by no means complete but it gives you an idea where you can go to search for the stuff you need and where to submit your site, if you are a site owner. For both categories – users and site owners – relying solely on Google is not a viable idea. For users it means that vast amounts of information are practically unaccessible for you and for site owners relying only on Google to generate traffic to your site (unless you manage to top the search results on many keywords – but it is a different topic) is pretty risky – you never know when Google will change its algorithm for calculating relevancy and you might drop from the top. So in both cases you need to know about the alternatives you have.

Why Google =/ Internet

As I have already said, it is a common delusion that Google=Internet. No, it is not and it will never be. There are probably hundreds of reasons why this is so but I hope that even some of them are convincing enough:

  1. Search engines and their indexing algorithms might be very powerful but it will never be possible to include into their database every single page that exists on the Web. At least because new pages appear every instant, while search engine crawlers do not visit sites so often – sometimes a site is revisited one or two months after the previous visit and all the pages that appeared after the last visit will not be indexed. You see why search engines do not deliver real-time weather, stock, or news information?
  2. The site, on which the page you want resides, requires registration and/or a fee and there is nothing on Earth that the search engine can do to index such a site. If the site requires registration, after you register, you will be able to find the stuff you need but there is no other way to know if the stuff is there besides visiting the site and registering. It is a similar situation, when the site of interest to you is locked inside a database and search engines cannot access it because of that but when you go there, the site is searchable by humans and you can get what you want.
  3. Generally, search engines prefer static to dynamic sites and are reluctant to index dynamic ones (these are database sites where pages are generated dynamically on a user’s request). While dynamic sites are more powerful from a technical point of view, they are not the favorites of search engines and often are either not indexed at all, or only part of their content is crawled. If the URLs of the pages have question marks and other special symbols (like non-ASCII characters), then it is a good bet that this page will not be indexed by search engines, or at least the major ones.
  4. Due to a violation of the fair play rules, search engines have excluded (temporarily or permanently) particular sites from their listings. This situation is much worse for site owners than for ordinary users but it is another reason for you – the ordinary user – not to be able to find the stuff you want. It is a little consolation that after some time the site will be indexed again. Even if this happens, you will spend some time in the shadows of the Invisible Web.

Search Engines and Search Directories

There are hundreds of search engines and search directories that contain Invisible Web content. What is more, there are even ones that claim to be an exhaustive collection of such documents. Three of the most popular search engines for the Invisible Web are Direct Search (http://www.freepint.com/gary/direct.htm), The Invisible Web (http://www.invisibleweb.com), and CompletePlanet (http://www.completeplanet.com). They index mainly, but not only, Invisible Web content.

If you are searching for content that is limited to a particular topic only (e.g. programming), a great time-saver are topical search engines, because they return results related to the selected area. There are search engines for almost every topic you can think of – from gardening to nuclear weapons. A nice list of topical search engines can be found at http://www.searchengineguide.com/searchengines.html.

Metasearch engines return combined results, harvested by a syndicate of search engines, so if Google does not have a particular page in its index but this page is indexed by Yahoo or another search engine, you will get it retrieved in the results. Metasearch engines were popular especially before the advent of Google but now they are regaining popularity. Examples of popular metasearch engines are Copernic (http://www.copernic.com), Beaucoup (http://beaucoup.com), Metacrawler (http://www.metacrawler.com), Dogpile (http://www.dogpile.com), and SurfWax (http://www.surfwax.com).

Some of the search engines offer a directory service as well and there you can browse by topic. Besides the directory services of search engines, there are specialized search directories. Search directories are an important place to submit your site to.

Once upon a time the distinction between a search directory and a search engine was clear but today, when search directories provide search tools and search engines offer lists of topical links, there is no sharp boundary between the two services. Basically, the difference is that search engines crawl the Web to find pages, while people submit their pages to search directories. Search directories are collections of links that are organized hierarchically by topic – for instance the top-level topics are entertainment, business, education, technology, etc. These topics are further divided into subtopics, which in turn have subtopics of their own, etc. In the above example, subtopics of technology, for example could be computers, biotechnology, personal tech, etc. One of the advantages of search directories is that their content is reviewed by humans and irrelevant pages are excluded from the listings.

Some of the most popular search directories are the Open Directory Project (DMOZ – http://www.dmoz.org), The Invisible Web Directory (http://www.invisible-web.net), Librarians’ Index to the Internet (http://lii.org), About.com (www.about.com), Infomine (infomine.ucr.edu), Yahoo! (www.yahoo.com – the Directory Service, not the search engine), Google (http://www.google.com/dirhp), etc.

Databases, Specialized Portals and Reference Sources

Invisible Web Databases, specialized portals and reference sources can be an extremely valuable resource especially for very specific stuff. While search engines and search directories list only links to pages (and it happens that these links are broken), specialized databases and portals generally contain the pages themselves, so it is less likely to encounter a broken link or a missing document. There are specialized databases and portals for many topics, and if you write “medical database”, for example, Google will display a long list of medical databases only. Now you can go to the URL of the database and search from there. Another example of a specialized database is FindArticles – http://www.findarticles.com, which contains over 5 million articles most of which are not indexed by the major search engines. A valuable source for virtual reference is The Internet Public Library (www.ipl.org). Of course, there are many other resources that can be quoted but I leave the fun of discovering them to you.

Posted in Web | Tagged , , , | Comments Off