An Analysis of the IntelX Scrape

I saw an interesting post on RaidForums today. If you are not aware, RaidForums is a website where people share/trade/sell data breaches. When you see a news article refer to a top secret, underground, hidden site, on the DARK network, they are probably just referring to RaidForums, which is easily accessible within any web browser. What caught my eye was the title of "scrape of pastes on intelx.io". IntelX is a website which I have used in the past to search for content within Pastebin archives. IntelX charges from $2,000 to $10,000 per year to access data publicly scraped from Pastebin, but I have always encouraged people to take advantage of free trials whenever the need surfaced. A complete collection of the entire IntelX archive seemed like a useful data set. I grabbed a copy and dug in.

The data set contained 87,813 text files which each appear to be complete scrapes of each paste represented. The decompressed size was just over 6GB. Using RipGrep, I conducted searches for pastes that might be relevant to me. I started with a query of "inteltechniques" and received dozens of hits. Almost all were referring to links from my website, and nothing exciting. I then conducted a search of "@gmail.com:" since many credential lists are presented as email:password.

As expected, millions of email/password combinations appeared, as seen above. The entire collection possesses 46,176,519 email addresses. I suspect the vast majority of these are already within various credential combo lists. This data could be extremely valuable in order to see an entire paste file as it appeared on Pastebin, especially since sensitive pastes get removed often.

It should be noted that IntelX has downplayed this scrape. They state that this collection is only a small percentage of the pastes they have collected. This is absolutely true, but this downloadable collection contains only the good stuff. I believe this scrape is much more useful than the entire paste collection, as all of the "junk" pastes have been eliminated which do not contain domains or email addresses. I have always been surprised that IntelX charged so much money to access publicly available information. This data set prevents the need to create a trial in order to research beneficial archives. The ability to conduct keyword searches with local data is much superior to any online search. I no longer need to worry about revealing details of my investigation to any third party. Since IntelX acquired 100% of this data from public sources, and it was then scraped through their official public API, I don't have issue downloading my own copy for research.