Password Strength Prediction

Project Objective

The objective of this project is to predict the strength of passwords using machine learning models. This is achieved by training models on a dataset containing passwords and their corresponding strength labels. The models can then classify new passwords as weak, medium, or strong.

Dataset Description

The dataset used in this project consists of passwords and their respective strength labels, which are categorized into three classes: Weak, Medium, and Strong. The data is stored in an SQLite file named password_data.sqlite and contains two columns:

password: The actual password string.
strength: The strength label of the password (0 for Weak, 1 for Medium, 2 for Strong).

Models Used

Neural Network: A deep learning model built using TensorFlow and Keras, with multiple dense layers and dropout layers to prevent overfitting.
RandomForestClassifier: A machine learning model using an ensemble of decision trees to improve classification performance.

Notebook Description

Mounting Google Drive

from google.colab import drive

drive.mount('/content/drive')

Loading Data from SQLite File

import sqlite3
import pandas as pd

sqlite_file_path = '/content/drive/MyDrive/password_data.sqlite'
conn = sqlite3.connect(sqlite_file_path)
query = "SELECT password, strength FROM Users"
data = pd.read_sql_query(query, conn)
conn.close()

Loads the password data into a pandas DataFrame.

Data Preprocessing and Feature Extraction

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X = data['password']
y = data['strength']

vectorizer = TfidfVectorizer(analyzer='char', max_features=100)
X_tfidf = vectorizer.fit_transform(X).toarray()

X_train_val, X_test, y_train_val, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

Transforms passwords into TF-IDF features and splits the data into training, validation, and test sets.

Calculating Class Weights

from sklearn.utils import class_weight
import numpy as np

class_weights = class_weight.compute_class_weight(class_weight='balanced', classes=np.unique(y), y=y)
class_weights_dict = {i: class_weights[i] for i in range(len(class_weights))}

Computes class weights to handle class imbalance.

Building and Training the Neural Network

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential([
    Dense(256, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.5),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(3, activation='softmax')
])

optimizer = Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=50, batch_size=64, validation_data=(X_val, y_val), class_weight=class_weights_dict, callbacks=[early_stopping])

Trains the neural network model with early stopping to prevent overfitting. The training and validation loss and accuracy are printed for each epoch.

Output:

Epoch 1/50
938/938 [==============================] - 5s 4ms/step - loss: 0.8614 - accuracy: 0.5306 - val_loss: 0.7110 - val_accuracy: 0.6511
Epoch 2/50
938/938 [==============================] - 8s 9ms/step - loss: 0.6014 - accuracy: 0.6228 - val_loss: 0.6811 - val_accuracy: 0.6554
Epoch 3/50
938/938 [==============================] - 5s 6ms/step - loss: 0.5338 - accuracy: 0.6569 - val_loss: 0.6273 - val_accuracy: 0.6779
Epoch 4/50
938/938 [==============================] - 5s 6ms/step - loss: 0.4950 - accuracy: 0.6780 - val_loss: 0.5886 - val_accuracy: 0.7027
Epoch 5/50
938/938 [==============================] - 4s 4ms/step - loss: 0.4705 - accuracy: 0.6922 - val_loss: 0.5631 - val_accuracy: 0.7211
Epoch 6/50
938/938 [==============================] - 4s 4ms/step - loss: 0.4504 - accuracy: 0.7037 - val_loss: 0.5433 - val_accuracy: 0.7255
Epoch 7/50
938/938 [==============================] - 5s 5ms/step - loss: 0.4393 - accuracy: 0.7120 - val_loss: 0.5116 - val_accuracy: 0.7376
Epoch 8/50
938/938 [==============================] - 5s 5ms/step - loss: 0.4254 - accuracy: 0.7185 - val_loss: 0.5496 - val_accuracy: 0.7175
Epoch 9/50
938/938 [==============================] - 4s 4ms/step - loss: 0.4145 - accuracy: 0.7255 - val_loss: 0.5337 - val_accuracy: 0.7357
Epoch 10/50
938/938 [==============================] - 4s 4ms/step - loss: 0.4068 - accuracy: 0.7283 - val_loss: 0.5413 - val_accuracy: 0.7354
Epoch 11/50
938/938 [==============================] - 6s 6ms/step - loss: 0.4047 - accuracy: 0.7324 - val_loss: 0.5197 - val_accuracy: 0.7427
Epoch 12/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3928 - accuracy: 0.7391 - val_loss: 0.4932 - val_accuracy: 0.7641
Epoch 13/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3921 - accuracy: 0.7409 - val_loss: 0.4890 - val_accuracy: 0.7586
Epoch 14/50
938/938 [==============================] - 5s 5ms/step - loss: 0.3819 - accuracy: 0.7506 - val_loss: 0.5116 - val_accuracy: 0.7479
Epoch 15/50
938/938 [==============================] - 4s 5ms/step - loss: 0.3762 - accuracy: 0.7511 - val_loss: 0.5038 - val_accuracy: 0.7503
Epoch 16/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3729 - accuracy: 0.7534 - val_loss: 0.4959 - val_accuracy: 0.7618
Epoch 17/50
938/938 [==============================] - 4s 5ms/step - loss: 0.3656 - accuracy: 0.7598 - val_loss: 0.4865 - val_accuracy: 0.7678
Epoch 18/50
938/938 [==============================] - 5s 5ms/step - loss: 0.3609 - accuracy: 0.7579 - val_loss: 0.4696 - val_accuracy: 0.7703
Epoch 19/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3575 - accuracy: 0.7641 - val_loss: 0.4403 - val_accuracy: 0.7875
Epoch 20/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3525 - accuracy: 0.7619 - val_loss: 0.4681 - val_accuracy: 0.7699
Epoch 21/50
938/938 [==============================] - 6s 6ms/step - loss: 0.3486 - accuracy: 0.7672 - val_loss: 0.4593 - val_accuracy: 0.7807
Epoch 22/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3452 - accuracy: 0.7695 - val_loss: 0.4488 - val_accuracy: 0.7897
Epoch 23/50
938/938 [==============================] - 4s 4ms/step - loss: 0.3407 - accuracy: 0.7714 - val_loss: 0.4822 - val_accuracy: 0.7680
Epoch 24/50
938/938 [==============================] - 5s 5ms/step - loss: 0.3408 - accuracy: 0.7713 - val_loss: 0.4425 - val_accuracy: 0.7902

Evaluating the Neural Network Model

test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {test_accuracy:.2f}')

Evaluates the neural network model on the test set and prints the test accuracy.

output:

Test Accuracy: 0.79

Making Predictions with the Neural Network Model

y_test_pred = model.predict(X_test)
y_test_pred_classes = y_test_pred.argmax(axis=1)

Output:

625/625 [==============================] - 1s 1ms/step

print(classification_report(y_test, y_test_pred_classes, target_names=['Weak', 'Medium', 'Strong']))

Prints the classification report for the test set, showing precision, recall, and F1-score for each class.

Output:

              precision    recall  f1-score   support

        Weak       0.46      0.98      0.63      2700
      Medium       0.97      0.74      0.84     14852
      Strong       0.71      0.90      0.79      2448

    accuracy                           0.79     20000
   macro avg       0.72      0.87      0.75     20000
weighted avg       0.87      0.79      0.80     20000

Predicting Password Strength with the Neural Network Model

def predict_password_strength(password):
    password_transformed = vectorizer.transform([password]).toarray()
    strength_pred = model.predict(password_transformed)
    strength_class = strength_pred.argmax(axis=1)[0]
    return ['Weak', 'Medium', 'Strong'][strength_class]

new_password = "hhhhhhhhhhhccGGG_@FSJSK52424hhhhhhhhhhhhhhhhhhhhhhhhhhhhh"
predicted_strength = predict_password_strength(new_password)
print(f'The predicted strength of the password "{new_password}" is: {predicted_strength}')

Predicts the strength of a new password and prints the result.

Output:

The predicted strength of the password "hhhhhhhhhhhccGGG_@FSJSK52424hhhhhhhhhhhhhhhhhhhhhhhhhhhhh" is: Weak.

Training the RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

rf_model = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

Trains the RandomForestClassifier model.

Evaluating the RandomForestClassifier Model

y_test_pred_rf = rf_model.predict(X_test)
print(classification_report(y_test, y_test_pred_rf, target_names=['Weak', 'Medium', 'Strong']))

Prints the classification report for the test set, showing precision, recall, and F1-score for each class.

Output:

              precision    recall  f1-score   support

        Weak       0.94      0.62      0.75      2700
      Medium       0.92      0.99      0.95     14852
      Strong       0.96      0.86      0.91      2448

    accuracy                           0.92     20000
   macro avg       0.94      0.83      0.87     20000
weighted avg       0.93      0.92      0.92     20000

Predicting Password Strength with the RandomForestClassifier Model

def predict_password_strength_rf(password):
    password_transformed = vectorizer.transform([password]).toarray()
    strength_pred = rf_model.predict(password_transformed)[0]
    return ['Weak', 'Medium', 'Strong'][strength_pred]

new_password = "charif"
predicted_strength_rf = predict_password_strength_rf(new_password)
print(f'The predicted strength of the password "{new_password}" is: {predicted_strength_rf}')

Predicts the strength of a new password and prints the result.

Output:

The predicted strength of the password "charif" is: Weak

Conclusion

In this project, two different models were trained to predict password strength: a neural network and a RandomForestClassifier. Both models were evaluated on a test set, and their performances were compared. The neural network achieved a test accuracy of 79%, while the RandomForestClassifier achieved a higher accuracy of 92%.

The notebook also includes functions to predict the strength of new passwords using the trained models, demonstrating the practical application of these models in real-world scenarios.

Feel free to clone this repository and experiment with the models and data. If you have any questions or suggestions, please open an issue or contact me directly.