It’s been close to 3 years since I released 835,955,029 passwords used in real life for all to check their password posture and understand their credentials exposed. This will help each and everyone to check the data exposure related to their current, past, or future passwords.
The primary idea of this tool is to provide feedback through a simple web-based console accessible as https://XposedOrNot.com. The same is also available as an API to be consumed by any application or enterprise for strengthening their password posture. API and related information will be explained in few minutes.
Now comes the big question of how I am able to source such a huge amount of plain text passwords. Well the answer is pretty simple. There are various places through which we can source these passwords, provided you have the time and energy to find and sort them.
Finding them is easy, but sorting them from their raw sources is too painful and I had to agree that sorting takes more time compared to sourcing the passwords.
Let’s start with the infamous 1.4 billion collection which was touted as the largest single collection of plain text found in dark web. Well, that’s one of my easiest and less painful one to extract the passwords. This dump gave me 1,400,553,869 ( 1.4 Billion +) passwords out of which 463,625,357 ( 463 Million ) were unique.
My next source of passwords were extracted from a large torrent of such dumps being shared across in breach exposure websites. That gave me another massive collection of passwords. Not to mention, another infamous collection was from ExploitIN dump which contained 793,372,299.
One another interesting thing I have to mention here is that during this scouting process, I came into possession of enough data breaches and their raw exposed files spread across on plain Internet in various dump sites.
Couple of torrents sharing data breach dumps with sizes exceeding into 100s of GB in size. Since most of the breaches except baring few have their passwords stored in various forms of easily recoverable MD5 hashes to the not to easy bcrypt hashes, was wondering a lot about what to do with all those hashes.
I do not have the huge processing capacity needed to crack these passwords and at the same time not ok to crack those passwords myself. This is where hashes.org comes in handy, brilliant set of folks who have taken the time and effort to crack most of these exposed breaches.
My sincere thanks to them for their work has helped me a lot in this exercise. In fact, their collection was so good that they even had cracked passwords which were not even in my repository of breaches saved so far.
After crossing couple of hundred millions of unique passwords, the next obvious search led me to scrape pastebin for exposed credentials that are posted there very frequently and most of the times anonymously. This one took even more time and effort as it was not a direct one. First, we need to identify pastes which have passwords exposed.
To identify that, I ran couple of twitter searches and identified accounts which more or less contained such information befitting the purpose. With a positive search output, started collecting all those tweets and that came around to 501,032 in number. Oh yes, its a large number of URL’s to be downloaded from pastebin. Next is to download all the pastes and quickly got impacted by the restrictions enforced by pastebin on scraping.
Interestingly, at the same time pastebin gave me an option to download the entire stuff through a white listed IP address and it costed less than $15 for life time access called PastebinPro. Looked like a good deal and I picked it up. So I tasked a simple python script to download the entire 501K addresses of pastebin and quickly came across the next limit on the speed at which I was downloading.
Pastebin complained stating that I was hammering them pretty hard and then I had to introduce a time lag with induced sleep of 1 second for every paste download to comply with 10k requests every 10 mins. It took me around 7 days, 7 whole days to download the entire set of addresses. Quick glance at the download shows there were some trends in the sizes of the files downloaded and most frequent one was only 33 bytes.
Opened up a 33 byte file and it states “Error, we cannot find this paste.”. Aha, then it struck me that pastes are not going to be living for ever and it is up to pastebin or creators of the pastes to decide the life and hence the 33 byte responses. Out of the 500K, 95,311 were not downloaded because either they were deleted for violating the policies of pastebin or deleted by original paste creators.
Anyways, with this entire one week of scouting/scraping activity led me to another couple of millions plain text and unique passwords. Well, I got to state that the various forms and styles in which passwords are exposed in pastebin is one hell of a task to handle. Luckily Python saved me again and with the help of good regex and parsers, I was able to overcome them within a day. Point to note here is that, it is against the policies of pastebin to share credentials however there are still enough pastes which still contain plain text passwords available for all to view/search use or misuse.
If pastebin is interested or anyone from pastebin is reading this or can provide me with a contact, I am more than happy to share those addresses for remedying this open exposure.
Now that we have a good collection of ~850 million real time passwords, next logical step is to create something useful for all rather than sit in one of my private folders serving none. Created a simple one page search and populated this using response from API for quick connectivity.
With the help of zxcvbn (Password Strength Estimation tool), we can further strengthen the password field strength for the benefit of all. For added benefit I have also thrown in an additional 1 billion plus words from a famous wordlist for adding further strength to the process. Roughly both put together we get 1.85 billion words to be checked against and in my humble opinion should be sufficient for us now and in the near future.
Alright enough said, let us check few of the screens captured for the benefit of all.
I want to reserve the technology and architecture for a detailed blog post. However want to provide you with the basic details of technologies used and architecture deployed for hosting XoN.
My entire infrastructure requirement for this activity is only a single instance of Linux running Ubuntu with 1 vCPU and 4 Gb RAM in Google Cloud. From a cost benefit stand point and also depending on the infrastructure provided running the most used website Google, I think I can depend on them for long. Since it runs on GCP, I have also enabled auto-scaling to meet demands during peak usage or heavy use.
From a technology stand point, I have only used custom Python scripts for downloading, sorting, inserting and parsing all the data downloaded to extract the passwords. Python has given me the flexibility and speed to code these functionalities as well as excellent supporting libraries for each and every requirement needed as part of this exercise.
Same goes for API as Python is the base on which I have build the API which is the main engine powering the search function and returning results in microseconds.
Entire data is stored in Google Datastore for lightning fast search results needed in our requirement. Cloud Datastore is a highly-scalable NoSQL database which automatically handles sharding and replication, providing us with a highly available and durable database that scales automatically to handle our applications load thereby giving us infinite scope for scaling.
Well, well with all these exposed passwords and using this as a source for XoN services raises the next big question on how do I handle privacy.
First point to mention is that none of the passwords are stored as plain text instead they are stored in SHA3-Keccak 512 which is a 128 byte one way hash. Currently there is only one tool which has even implemented cracking keccak hashes.
Second point is that none of the passwords need to be sent directly as plain text to XoN irrespective of their usage through website or API. Only SHA3-Keccak 512 are accepted as input.
Third and last point to note is that I have also inserted anonymous searches as part of API for everyone interested to make use of this functionality. The idea is to send only the first ten characters of the 128 byte one way hash to be searched and the API will search the database only with the shared characters. If you look at the the set available with 10 characters we are having a huge value to the tune of 16 ^ 10 . This huge space also minimizes the clash of hashes beginning with the same 10 characters thereby giving close to accurate results even when using anonymized searches.
This also avoids majority of the concerns from folks who might be worried even about sharing a one way hash to the API for search.
Last, but not the least please visit https://XposedOrNot.com to find few more interesting pages to feed your appetite on related breaches.
Feel free to share in your feedback and comments down below for the benefit of all .
Update 05-Jul-2018: Cracking tool for Keccak hashes updated based on input from @.
1. Originally published at my personal blog https://www.devaonbreaches.com/2018/06/xposedornot-want-850-million-passwords.html
2. Few data points updated to reflect the current status and change.