Machine Learning
&
Neural Networks Blog

Traffic Accident Prediction

This Python script is designed for analyzing and predicting traffic accident severity using machine learning. It processes accident-related data, trains a predictive model, evaluates its performance, and provides insights into the key features influencing accidents. It employs a Random Forest classifier to analyze various features including temporal, spatial, environmental, and driver-related factors to predict accident severity levels (minor, moderate, severe).

Overview:
1. Data Input and Initial Processing
The implementation begins with raw CSV data containing accident records with multiple features:
○ Temporal features (Date, Time_of_Day)
○ Spatial coordinates (Latitude, Longitude)
○ Environmental conditions (Weather_Condition, Road_Condition)
○ Traffic-related features (Traffic_Condition, Traffic_Lights)
○ Vehicle and driver information (Vehicle_Type, Driver_Age)
○ Accident outcomes (Injury_Count, Fatalities)
The initial data loading phase employs pandas' read_csv function with specific dtype handling to ensure proper data type assignment, particularly for numerical columns that require precise processing.

2. Feature Engineering
The preprocessing stage implements several key transformations:
a. Temporal Feature Extraction:
○ Date parsing to extract month and day
○ Time_of_Day mapping to numerical hours (morning→9, afternoon→15, evening→20, day→12)
○ Day_of_Week encoding
b. Categorical Variable Encoding:
○ Implementation of LabelEncoder for categorical features
○ Binary encoding for boolean features (Traffic_Lights, Alcohol_Involvement)
○ Preservation of encoding mappings for future predictions
c. Feature Selection:
○ Temporal components
○ Geographical coordinates
○ Environmental conditions
○ Traffic parameters
○ Vehicle and driver characteristics
○ Accident statistics

3. Data Preparation
The implementation utilizes a structured train-test split approach:
○ 80-20 split ratio (training-testing)
○ Stratification by target variable (Accident_Severity)
○ Feature scaling using StandardScaler
○ Preservation of scaler parameters for prediction pipeline

4. Model Configuration
The Random Forest Classifier is configured with specific parameters for optimal performance:
○ 100 estimators (trees)
○ Maximum depth of 10 levels
○ Minimum samples split of 5
○ Minimum samples leaf of 2
○ Balanced class weights to handle potential class imbalance

5. Training and Evaluation Pipeline
The evaluation process includes:
a. Model training on scaled features
b. Prediction generation on test set
c. Comprehensive performance metrics:
○ Classification report with precision, recall, and F1-score
○ Confusion matrix visualization
○ Feature importance analysis

6. Visualization Components
The implementation includes several visualization elements:
a. Feature Importance Plot:
○ Bar chart of top 10 influential features
○ Importance scores based on Random Forest feature importance
b. Confusion Matrix Heatmap:
○ Visual representation of model predictions versus actual values
○ Color-coded for easy interpretation
○ Annotated with specific counts

7. Prediction System
The prediction pipeline implements:
a. Data Preprocessing:
○ Application of saved scalers and encoders
○ Feature alignment with training data
b. Prediction Generation:
○ Class prediction
○ Probability distribution across severity classes
○ Conversion of numerical predictions to original severity labels


Below is the full code with additional comments embedded.


 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.preprocessing import LabelEncoder, StandardScaler
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.metrics import classification_report, confusion_matrix
 import seaborn as sns
 import matplotlib.pyplot as plt


 def load_data(file_path):
    """ 
    Load data from a CSV file with specific column handling 
    """
    try:
        # Read CSV with all columns as strings initially to prevent any parsing errors
        df = pd.read_csv('path_to_csv_file\\accident.csv', dtype=str)

        # Convert numeric columns to appropriate types
        numeric_columns = ['Latitude', 'Longitude', 'Involved_Vehicles',
                           'Injury_Count', 'Fatalities', 'Driver_Age']
        for col in numeric_columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')

        print(f"Successfully loaded {len(df)} records from the dataset")
        return df
    except Exception as e:
        print(f"Error loading the CSV file: {e}")
        return None


 def preprocess_data(df):
    """ 
    Preprocess the data for model training 
    """
    # Create a copy to avoid modifying the original dataframe
    df = df.copy()

    # Convert Date to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract time-based features
    df['Month'] = df['Date'].dt.month
    df['Day'] = df['Date'].dt.day

    # Map Time_of_Day directly to hours (no datetime conversion needed)
    time_mapping = {
        'morning': 9,
        'afternoon': 15,
        'evening': 20,
        'day': 12
    }
    df['Hour'] = df['Time_of_Day'].map(time_mapping)

    # Categorical columns for encoding
    categorical_columns = [
        'Road_Type',
        'Weather_Condition',
        'Traffic_Condition',
        'Time_of_Day',
        'Day_of_Week',
        'Vehicle_Type',
        'Road_Condition'
    ]

    # Initialize dictionary for label encoders
    label_encoders = {}

    # Encode categorical variables
    for column in categorical_columns:
        label_encoders[column] = LabelEncoder()
        df[column] = label_encoders[column].fit_transform(df[column])

    # Convert boolean columns to integer
    boolean_columns = ['Traffic_Lights', 'Alcohol_Involvement']
    for column in boolean_columns:
        df[column] = df[column].map({'Yes': 1, 'No': 0})

    # Define features for the model
    features = [
        'Latitude', 'Longitude', 'Month', 'Day', 'Hour',
        'Road_Type', 'Weather_Condition', 'Traffic_Condition',
        'Day_of_Week', 'Involved_Vehicles', 'Injury_Count',
        'Fatalities', 'Traffic_Lights', 'Alcohol_Involvement',
        'Driver_Age', 'Vehicle_Type', 'Road_Condition'
    ]

    # Define target variable
    target = 'Accident_Severity'

    # Encode target variable
    label_encoders[target] = LabelEncoder()
    df[target] = label_encoders[target].fit_transform(df[target])

    return df, features, target, label_encoders

 def train_model(df, features, target):
    """ 
    Train the Random Forest model 
    """
    # Prepare feature matrix and target vector
    X = df[features]
    y = df[target]

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Scale the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train the model with class weight consideration
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=2,
        class_weight='balanced',
        random_state=42
    )
    model.fit(X_train_scaled, y_train)

    return model, scaler, X_test_scaled, y_test, X_train, y_train


 def evaluate_model(model, X_test, y_test, feature_names, label_encoders):
    """ 
    Evaluate the model and display results 
    """
    # Make predictions
    y_pred = model.predict(X_test)

    # Get original class names
    severity_encoder = label_encoders['Accident_Severity']
    class_names = severity_encoder.classes_

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=class_names))

    # Calculate and display feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)

    print("\nTop 10 Most Important Features:")
    print(feature_importance.head(10))

    # Plot feature importance
    plt.figure(figsize=(12, 6))
    sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
    plt.title('Top 10 Most Important Features')
    plt.tight_layout()
    plt.show()

    # Plot confusion matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names,
                yticklabels=class_names)
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()

    return feature_importance


 def predict_new_accident(model, scaler, new_data, features, label_encoders):
    """ 
    Make prediction for a new accident 
    """
    # Prepare the new data
    new_data = new_data[features].copy()

    # Scale the features
    new_data_scaled = scaler.transform(new_data)

    # Make prediction
    prediction = model.predict(new_data_scaled)
    probabilities = model.predict_proba(new_data_scaled)

    # Convert numerical prediction back to original class name
    severity_encoder = label_encoders['Accident_Severity']
    prediction_class = severity_encoder.inverse_transform(prediction)

    # Get class names for probability distribution
    class_names = severity_encoder.classes_

    # Create probability distribution dictionary
    prob_dist = {class_name: prob for class_name, prob in zip(class_names, probabilities[0])}

    return prediction_class[0], prob_dist


 # Main execution
 if __name__ == "__main__":
    # File paths
    input_file = "accident_data.csv"  # Replace with your actual file path

    # Load the data
    df = load_data(input_file)

    if df is not None:
        # Preprocess data
        df, features, target, label_encoders = preprocess_data(df)

        # Train model
        model, scaler, X_test, y_test, X_train, y_train = train_model(df, features, target)

        # Evaluate model
        feature_importance = evaluate_model(model, X_test, y_test, features, label_encoders)

        # Example of prediction for a new accident
        # Use the first row of test data as an example
        new_accident = df.iloc[[0]]
        prediction, probabilities = predict_new_accident(
            model, scaler, new_accident, features, label_encoders)

        print("\nExample Prediction:")
        print(f"Predicted Severity: {prediction}")
        print("\nProbability Distribution:")
        for severity, prob in probabilities.items():
            print(f"{severity}: {prob:.2%}")
                        

accident

accident

accident

accident

accident

Get the Jupyter Notebook and the dataset used in this project.

If you found this project interesting, you can share a coffee with me, by accessing the below link.

Boost Your Brand's Visibility

Partner with us to boost your brand's visibility and connect with our community of tech enthusiasts and professionals. Our platform offers great opportunities for engagement and brand recognition.

Interested in advertising on our website? Reach out to us at office@ml-nn.eu.