I have been looking for ways to increase the speed of password cracking. There have been a lot of research done on the topic but theres one team that has done some amazing work in this regard. There is a paper called OMEN: Faster Password Guessing Using an Ordered Markov Enumerator that was the inspiration for this post. The idea for this post was from this paper but all the work was manually done by me and no code or anything else have been copied. So, lets start.
Using Ordered Markov Chains and User Information to speed up Password Cracking
Increasing cracking accuracy by 22.5% from John the rippers markov and incremental model
After reading the above paper, I wanted a similar tool but the tool made by those guys is not available anywhere. So, I spent a few days and wrote the code for creating such a tool that is flexible enough to perforn n-gram and markov chains based password generation using a cracked password list. I wanted to tinker with the tool and writing my own tool gave me freedom to do so. The tool is still in developmental phase. Ill share it once its ready.
After writing the script, I wanted to see if we can leverage user information to speed up password cracking. There was a data breach for Fling, which is a dating/adult website. Their database was available publicly and I downloaded it and performed some analysis. Lets look into it.
Users using their personal information in their passwords:
With a simple query, I was able to see that what percentage of users have the same first three letters in their passwords and emails/usernames/usercodes/nicknames etc. There are total 4993276 passwords in the fling database. Lets see the percentage of same starting trigrams.
8% of total passwords means almost 386,894 passwords. Thats a large number.
Lets look at 1-grams.
17% people have the same first letter in their password as they have in their email or username or nickname.
Let us a play a bit with birth dates and joining dates. I wanted to see that how many users use their birthdates at the end of their passwords. The query showed that 2% users do this. Here are a few results.
So, after just doing some very basic analysis, we have come to know that around 20% users use their very basic information in their passwords.
If this is not enough, I have got another dump (ClixSense) containing more user information like username, first_name, last_name, email, country, city, date of birth, date of joining, security_answer etc. I have crafted a query to see how many passwords have the same first n-grams that are also present in one of their social information. The results are astonishing. Lets have a look at them.
Using 2 grams and some more social information like date of birth, we see that 32.5% users use some of their social information in their passwords. Heres the query.
select count(password) from users where lower(substring(password,1,2)) = lower(substring(username,1,2))
or substring(password,1,2) = substring(first_name,1,2)
or substring(password,1,2) = substring(email,1,2)
or substring(password,1,2) = substring(last_name,1,2)
or substring(password,1,2) = substring(city,1,2)
or substring(password,1,2) = substring(security_answer,1,2)
or substring(password,1,2) = substring(address1,1,2)
or substring(password,1,2) = substring(security_answer,1,2)
or substring(password,1,2) = substring(zip,1,2)
or substring(password,1,2) = substring(payable_to,1,2)
or lower(substring(password,1,2)) = lower(substring(country,1,2))
or substring(password,length(password)-3,4) = substring(date_of_birth,1,4)
or substring(password,length(password)-4,4) = substring(date_of_birth,1,4)
or substring(password,length(password)-3,4) = substring(register_date,1,4)
or substring(password,length(password)-3,2) = substring(username,length(username)-3,2)
or substring(password,length(password)-2,1) = substring(username,length(username)-2,1)
The result is 6.5 lac passwords out of 2 million passwords.
Can we leverage this to speed up password cracking? Yes, we can. Lets dive into markov chains
To put in simple terms, what Markov chains do is that they tell us the probability of a letter occurring after a n gram. Suppose, we have a 4 gram ilov, the Markov chain would just tell us the probabilities of the next letters. In this example, most of the time, the highest probability of next letter is e making it ilove. Thats all there is to Markov chains that you need to know. Given a n gram, we have the probabilities of next letters. If we sort these probabilities, we get passwords in the order of highest to lowest probabilities hence the name ordered markov chains.
John the ripper uses un-ordered Markov chains in a sense that it goes through one n gram completely before going to the next ngram. What my script did was go through all high probability n grams simultaneously using threads. I hope its clear, if its not, please feel free to ask anything in comments. You can also read the OMEN paper given in the start of this post.
I will share the script for doing all of this stuff once it is complete. For now, I am just posting the results and the data.
The data I used for cracking came from MuslimMatch database dump. The hashes were stored in MD5 and username, user_code,user_email and user_country was given in the database dump. I have ignored the country for now and used only the username, user code and user email. There were 77000 hashes to crack.
John the ripper Incremental Mode
I wanted to try 100 million passwords for each mode within length 6-12. First mode was incremental mode. The results were 15,000 hashes cracked for the first 100 million passwords. That is 20%
John the ripper Markov Mode
I also tried 100 million passwords here with level=210 and the same passwords lengths. The results were 20,000 hashes cracked for the first 100 million passwords. That is 25%.
Ordered Markov Chain with first three letters as n-grams from Users information
What I did here was get tri-grams, tetra-grams, penta-grams and hexagrams from user information and used them as the starting letters. The following letters were used from a passwords list of mate1 that is available publicly. I used only 5 million passwords from the dump to show that this approach can perform better even with lesser passwords since John the rippers markov model is trained on rockyou passwords and they are around 13 million. This way, I was using a separate password list for training(mate1) and a separate one for testing(muslimmatch).
I generated 60 million passwords with n grams of user information approach and 40 million with simple ordered Markov chain approach with training list as mate1 passwords.
The results were 25,000 hashes cracked for the first 100 million passwords. That is around 32%, more than the incremental mode and markov mode.
Why did this work? It worked because a large number of users use some part of their email or username or any other detail in their password and if we can first check passwords that start with n grams containing usernames portions, user emails portions etc, then we can intuitively speed up the accuracy and the experiments proved this intuition right. Another reason for increased accuracy was because I used my training word list from the same category as that of website i.e adult/relationship websites. These two factors were the main cause of the increase in accuracy.
Why does this matter? On my laptop, John the ripper generated 100 million passwords in less than 2 minutes. If we are cracking 32% passwords in less than 2 minutes, thats a great thing. Isnt it?
PS: I also did some experimentation with appending birth dates in the end and the results come out to be good.
Thats it. I hope this idea can be expanded further by you guys and we can see some amazing results. For privacy reasons, I cannot upload the pruned data of the dumps.
Note: This is by no means a statement that I have broken john the rippers speed and accuracy. That is a great tool made by some awesome people. I just think that using users social information as n grams with incremental markov models can increase the speed of password cracking to a good extent. Criticism is most welcome.