Mining In The Dark

Can analytics get to the data in the Dark Web?


Data mining the Dark Web has proven notoriously difficult. It is, by it’s very nature, incredibly secretive, consisting of constantly changing nodes that connect the user to the website, making it impossible for anyone to track users. It is inaccessible to search engines that crawl links on web pages, and is made up of a number of internet networks that go unused by the majority of the public, such as the best known network Tor, as well as Freenet and I2P.

For law enforcement, the Dark Web presents a huge problem. Hsinchun Chen estimated that as of 2010 there was in excess of 100,000 sites containing extremist and terrorist content, a number that has surely grown since with the rise of IS. Criminal users are also afforded ready access to weaponry, guns, drugs, and a variety of illegal services. While the closing of notorious online black marketplace Silk Road by the FBI in 2013 has gone some way to preventing such sales, there are still many means available to those looking to purchase illegal materials online anonymously. Of course, it’s worth noting that there are also a number of theories around who is actually in control of the Dark Web, with one of the most popular being that it’s actually the CIA, who use it as a means to keep tabs on the criminal world.

Dark Web intelligence is vital to finding and shutting down platforms that are being used to facilitate criminal activity, and Big Data can also be mined from it to provide a better understanding of criminal behaviour, so that it can be stopped in the real world. There are a number of challenges to accessing the information in the Dark Web for analytics programmes, including the constantly changing nodes that mean accessing a website is never done in the same way. Browsers such as Tor also do not use Javascript, which is what most analytics programmes need to run. However, it is becoming increasingly easy to apply analytics thanks to this year’s release of ‘Memex’ technology by the Defense Advanced Research Projects Agency (DARPA), the wing of the US Department of Defense that looks at the development of emerging technologies used by the military.

Memex has been created ostensibly as a tool to prevent human trafficking operations that occur on the Dark Web, but it is likely that it will soon be applied to preventing a far wider range of criminal activities. It is a means for users to organize subsets of information based on individual interests as quickly and thoroughly as possible. It aims to improve the ability of military, government and commercial enterprises to discover and organize mission-critical publically-available information on the internet. It is now used in all of the cases pursued by the New York County District Attorney's Office’s Human Trafficking Response Unit, and has played a part in generating at least 20 active sex trafficking investigations.

DARPA is set to make a handful of Memex tools available to the public later in the year, including advanced web crawling and scraping technologies, as well as Artificial Intelligence and machine learning, which enables the automated retrieval of virtually any content on the Internet. Its solution to mining the Dark Web is called SourcePin, which attempts to overcome the limitations of crawlers that can’t click or scroll in the same way humans do, and thus can’t collect ‘dynamic’ content that appears upon an action by a user. SourcePin acts more like a human user with a browser, and can scroll pages and even hover over an object on the page to reveal more content. As such, it can deal with virtually any web page scenario, including Tor, to provide an automated, standardized way of counting and exploring the many sites on the Dark Web. With this new kind of search engine, it is unlikely that the secrets of the Dark Web will remain secrets for long.


Read next:

Social TV: Cross-Channel Insights on the ShareThis Platform