The Deep Web has a bad reputation, mostly thanks to its association with the Dark Web. However, while the Dark Web may take place in the Deep Web, there is far more there than drugs and illegal pornography, and most of it is completely lacking malevolence.
The Deep Web actually includes any website that cannot be indexed by a search engine. This includes any page that cannot be detected by the ‘crawlers’ used by Google and its competitors to search the web for pages to fill its results pages. It consists primarily of database-driven websites, and any part of a website that’s past a login page. It also includes sites blocked by local webmasters, sites with special formats, and ephemeral sites. Google and other engines cannot reach these because it isn't programmed to fill out search forms and click on the submit button, rather, they must interact with the web server that's presenting the form, and send it the information that specifies the query and other data that the web server needs.
Estimates vary as to how much of the internet the Deep Web accounts for, but some top university researchers say it is more than 99% of the entire World Wide Web. There are tens of trillions of pages in the Deep Web, dwarfing the number of those that search engines can find, of which there are mere billions.
To mine the Deep Web manually would be an impossible task. There are currently a number of bots available that attempt to solve the problem. Such crawlers must be designed to automatically parse, process, and interact with form-based search interfaces that are designed primarily for human consumption. They must also provide input in the form of search queries, raising the issue of how best to equip crawlers with the necessary input values for use in constructing search queries. Stanford has built a prototype engine named the Hidden Web Exposer (HiWE, which tries to scrape the Deep Web for information using a task-specific, human-assisted approach overcome such issues. Others that are publicly accessible include Infoplease, PubMed and the University of California's Infomine. There is also BrightPlanet’s Big Data Mining tool, called the Deep Web Monitor, which allows you to set a specific query such as a location or keyword, and harvest the entire web for relevant information.