Data Mining In The Deep Web

Holding a huge amount of data, but being hidden, can you mine it?


The Deep Web has a bad reputation, mostly thanks to its association with the Dark Web. However, while the Dark Web may take place in the Deep Web, there is far more there than drugs and illegal pornography, and most of it is completely lacking malevolence.

The Deep Web actually includes any website that cannot be indexed by a search engine. This includes any page that cannot be detected by the ‘crawlers’ used by Google and its competitors to search the web for pages to fill its results pages. It consists primarily of database-driven websites, and any part of a website that’s past a login page. It also includes sites blocked by local webmasters, sites with special formats, and ephemeral sites. Google and other engines cannot reach these because it isn't programmed to fill out search forms and click on the submit button, rather, they must interact with the web server that's presenting the form, and send it the information that specifies the query and other data that the web server needs.

Estimates vary as to how much of the internet the Deep Web accounts for, but some top university researchers say it is more than 99% of the entire World Wide Web. There are tens of trillions of pages in the Deep Web, dwarfing the number of those that search engines can find, of which there are mere billions.

For Data Scientists, the Deep Web presents a huge problem. Obviously, there are a number of difficulties inherent in mining something that’s hidden. But the data held is so vast that failure to at least try would be a huge folly. There is much of value held within the Deep Web too, particularly searchable databases. There are thousands of high-quality, authoritative online specialty databases, and they are extremely useful for a focused search. PubMed is one example. PubMed consists of documents that focus on very specific medical conditions that are authored by professional writers and published in professional journals. There is also the Tor network, which hosts much of the Dark Web, the mining of which is of obvious benefit to law enforcement agencies. However, browsers such as Tor do not use Javascript, which is what most analytics programmes need to run, making it all the harder for analytics software to mine it.

To mine the Deep Web manually would be an impossible task. There are currently a number of bots available that attempt to solve the problem. Such crawlers must be designed to automatically parse, process, and interact with form-based search interfaces that are designed primarily for human consumption. They must also provide input in the form of search queries, raising the issue of how best to equip crawlers with the necessary input values for use in constructing search queries. Stanford has built a prototype engine named the Hidden Web Exposer (HiWE, which tries to scrape the Deep Web for information using a task-specific, human-assisted approach overcome such issues. Others that are publicly accessible include Infoplease, PubMed and the University of California's Infomine. There is also BrightPlanet’s Big Data Mining tool, called the Deep Web Monitor, which allows you to set a specific query such as a location or keyword, and harvest the entire web for relevant information.

University lecture small

Read next:

How Are Higher Education Institutions Using Analytics?