TAG | deep web

While Google is the most popular search engine and improving the quality of its results all the time, it still can only search a fraction of the information available on the web.  At the end of 2010, Google added its trillionth address (1,000,000,000,000 searchable websites) and yet this is still only the tip of the iceberg compared to what is available. (Wright, 2009)

This vast amount of information, which is inaccessible to Google and other popular search engines, is collectively referred to as the deep web, the hidden web or the invisible web, in contrast to the documents which these search engines can access, collectively referred to as the surface web or the visible web.

No one knows for sure how big the deep web is, but all agree that it is vast.  According to one estimate, the “total quality content of the deep web is at least 1,000 to 5,000 times greater than that of the surface web” (CompletePlanet), while the Kosmix website states that “experts estimate that search engines can access less than 1% of the data available on the Web“.  Whichever is true, both indicate that there is an enormous amount of quality information which our students are not tapping into if they are only using Google for their research.

Harvesting the Deep and Surface Web with a Directed Query Engine (Michael Bergman)

The deep web has been described by Maureen Henninger, author of the book  The Hidden Web: Finding quality information on the net as “publicly accessible, non-proprietary pages that are not ‘seen’ by the spiders of general search engines” (2008, p. 162).  Dr Marcus Leaning of the University of Winchester categorises the web into three sections:

•       Free, visible web - designed for the web and for being searched.

•       Free, invisible web – resistant to being searched, sites require their own search engine.

•       Not free, invisible web – closed networks, require their own search engines or need passwords. (manipulating-media.co.uk, 27/08/2010)

Much of the information contained in the deep web is referred to as ‘grey literature’ or ‘white  papers’ and these can be defined as “working documents, pre-prints, research papers, statistical documents, and other difficult-to-access materials that are not controlled by commercial publishers” (LAOAP).  Producers of grey literature include research groups, non-for-profit groups, universities and government departments.

One reason why Google is not able to access many of these documents within the deep web is that they are often located in databases that require their own search engines.  If you know the name of the database, Google can take you to it, but it cannot search the database to retrieve information from it.  Our BGS subscription databases, accessible through MyGrammar, form part of the deep web.  They contain high-quality academic information that Google cannot retrieve, so it is essential that our students learn to search here first when researching for assignments.

There are also search engines and subject gateways specifically designed to access the deep web, and some of these are: Intute, Incy Wincy, Science Accelerator, BUBL , WWW Virtual Library and Infomine.  At BGS, we encourage the boys to use these search engines in conjuction with our subscription databases and Google Scholar to find academic, quality, peer-reviewed infomation for their assignments.

Read more:
Deep Webhttp://www.kosmix.com/topic/deep_web/overview/uc_kosmixarticle_cached#ixzz1Egwlsnt6
Henninger, M. (2008). The hidden web: Finding quality information on the net (2nd ed). Sydney: UNSW Press.

BrightPlanet  http://brightplanet.com/the-deep-web/deep-web-faqs/

Wright, A., (2-23-09) Exploring a ‘Deep Web’ that Google can’t Grasp,  New York Times,  http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&ref=business

Bergman, M. (2001). The Deep Web: Surfacing Hidden Value, Journal of Electronic Browsing, Retrieved from  http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104

Leaning, M., (22-10-2010), Searching for and finding new information – Desk research – tools, strategies and techniqueshttp://manipulating-media.co.uk/2010/08/27/searching-for-and-finding-new-information-desk-research/

Deep Web Video -  Office of Science and Technical Information

· ·



Google is without a doubt the most widely used search engine in the world, and has far outstripped its rivals in terms of popularity and usage, to the extent that the verb ‘google’ is now  part of our common language. In 2008 Google was handling 65 million searches per hour (Britannica Online). In 2010, one estimate was 34,000 searches per second (2 million per minute; 121 million per hour; 3 billion per day; 88 billion per month). Google is fast, clean and returns more results than any other search engine, but does it really find the information students need for quality academic research?   According to Dr Marcus Leaning from the University of Winchester, the answer is often ‘no’.  He states, “while simply typing words into Google will work for many tasks, academic research demands more.” (Searching for and finding new information – tools, strategies and techniques, August 27, 2010)

As far back as 2004, James Morris, Dean of the School of Computer Science at Carnegie Mellon University, coined the term “infobesity,” to describe “the outcome of Google-izing research: a junk-information diet, consisting of overwhelming amounts of low-quality material that is hard to digest and leads to research papers of equally low quality.” (Is Google enough? Comparison of an internet search engine with academic library resources.)

Our challenge is to encourage our students to move from infobesity to infodieting.

Five reasons not to use Google first

  1. Google gives you the good with the bad, a mixture of trustworthy and not-so-trustworthy web sites. On the internet there is no standardized review process by editors, publishers, and librarians and very little control over what is being published. A website could be created by anyone, with few or no credentials, and they could present misinformation, biased information or false information.   It takes time and effort to check the validity and reliability of every website, and very few students would regularly do this.
  2. Google’s priority ranking system is determined by software and is dependant, to a large extent, on how many times a particular website has been linked to by others – i.e. its popularity.  However popular websites are not necessarily truthful or trustworthy. When these sites are listed first, students search them first, thus reinforcing the number of links to them and ensuring they stay near the top of the rankings.
  3. Google returns too many results. Students rarely search beyond the first one or two pages of results, and if they haven’t been taught to search properly, they can keep retrieving the same sites again and again. They need to be taught the techniques of power searching – effective ways to access high-quality information – otherwise they will waste a lot of time viewing irrelevant websites.  (Beyond Google – 15 Tools and Strategies for Improving Your Web Search Results)
  4. Google (and other popular search engines) are unreliable at searching the deep web, peer-reviewed or refereed content.  Google only searches a small percentage of the content available on the web, information  referred to as the visible web or the surface web. According to BrightPlanet, the “total quality content of the Deep Web is at least 1,000 to 5,000 times greater than that of the Surface Web.” The invisible or deep web often requires the use of passwords to access the information (e.g. subscription databases) or these sites use their own search engines, thus effectively blocking Google from accessing them.
  5. Advertisments, links and pop-ups are displayed on websites for profit, and can distract students when researching.

According to Chris Sherman, Associate Editor of SearchEngineWatch.com, “vast expanses of the Web are completely invisible to general purpose search engines” but there are ways “to find the hidden gems search engines can’t see.” (quoted in Those Dark Hiding Places: The Invisible Web Revealed). The New York Times in 2009 also quoted Chris Sherman as saying,  “Google faces a real challenge …  They want to make the experience better, but they have to be supercautious with making changes for fear of alienating their users.” (Exploring a ‘Deep Web’ that Google Can’t Grasp)

While Google is an extremely useful search engine for many purposes, it is essential that for academic research our students learn to access quality information located in our subscription databases and in the deep web. At Brisbane Grammar School we are always working towards this goal.

When Not to Google: Searches You’re Better Off Making Elsewhere

Image from http://www.pro-webmarketing.com/search-engines

· · ·