The invisible web, also known as the deep web and the hidden web, refers to web content that is not included (“not visible”) to most search engines.
This page integrates a number of tools designed specifically for searching the invisible web. However, many of the search tools across the many sections of Fagan Finder are themselves part of the invisible web, or include results that are part of the invisible web.
For example, INFOMINE, LII, and RDN are three academic directories included on the Advanced Reference page, which include tons of invisible web resources.
There are several reasons why web content may not be included in search engines:
If a web page is very far down in a directory structure (such as http://www.example.com/dir/sub/001/jan/mon/10.html), then a search engine may decided not to go that deep.
Many search engines only include the first 100 kb or so of a web page. If it is longer, the rest of the page may not be included. Google goes 101 kb deep; Yahoo! goes to 500 kb.
Search engines recrawl web pages at different rates: some once a month (and some less often), and some every few minutes. Even if a certain page is included in a search engine, it may not have the most recent content added to that page. If you’re looking for really recent information, try searching through Weblogs and News. Some tools, especially some of the blog search engines, may include content that was published online just minutes ago.
Most search engines find web pages by following links; If there are no links to a page it may never be found.
Webmasters can specify in a file called robots.txt or using meta tags that they do not want certain pages to be indexed. Responsible robots (the software that search engines use to download web pages) don’t index it.
Protected & Fee
Some pages require a username and password to view (often money is required); search engines can’t get past this.
Once upon a time, search engines only included web content that was in HTML format. Today, the major search engines do include some content in other formats. These formats include Microsoft Office and Adobe Portable Document Format documents.
For more information about what engines include what type of content (and to search through it), see Fagan Finder’s Search by File Format.
Dynamic, database-driven pages (such as those that include characters like ? in the URL) scare many search engines, for fear of getting caught in an endless loop of pages on the same website. Some search engines include dynamic content somewhat, such as Google.
The first two reasons (deep and big) are easy to change from the perspective of search engines. Many today go deeper into the directory structure and farther down the page than they used to.
Pages without links are often of poor quality (hence no links), and so this is not a significant problem. They can still be included in most search engines by submitting the page.
The next two reasons (robots excluded and protected) are not “problems”; if people don’t want others on their website, then search engines shouldn’t be going there.
The last two reasons (non-HTML and dynamic) constitute what most people consider to be the invisible web, and more recently, exclusively the last reason.
There are many pages on the web which are not in HTML. See Search by File Format for more on that.
Google includes more non-HTML content than any other engine. There are millions of dynamic pages, many of which are exactly what you may be looking for,
and many others of which are an infinite number of near-duplicate pages.
Indexing dynamic pages has the potential to stall a search engine’s crawler, which is why most don’t do this. It is believed that Google index dynamic pages that it finds by following links from static ones,
but does not follow links from dynamic pages themselves.
June 13, 2004: I’ve gone through the page information, and rewritten/reworded most of it. ResourceShelf has been added as a search tool. Link to Invisible-web.net and Salon article. Other minor changes and additions.
June 11, 2004: Logging updates from now on. This page was created previously.