Traffic Accident Prediction

This Python script is designed for analyzing and predicting traffic accident severity using machine learning. It processes accident-related data, trains a predictive model, evaluates its performance, and provides insights into the key features influencing accidents. It employs a Random Forest classifier to analyze various features including temporal, spatial, environmental, and driver-related factors to predict accident severity levels (minor, moderate, severe).

1. Data Input and Initial Processing
The implementation begins with raw CSV data containing accident records with multiple features:
○ Temporal features (Date, Time_of_Day)
○ Spatial coordinates (Latitude, Longitude)
○ Environmental conditions (Weather_Condition, Road_Condition)
○ Traffic-related features (Traffic_Condition, Traffic_Lights)
○ Vehicle and driver information (Vehicle_Type, Driver_Age)
○ Accident outcomes (Injury_Count, Fatalities)
The initial data loading phase employs pandas' read_csv function with specific dtype handling to ensure proper data type assignment, particularly for numerical columns that require precise processing.

2. Feature Engineering
The preprocessing stage implements several key transformations:
a. Temporal Feature Extraction:
○ Date parsing to extract month and day
○ Time_of_Day mapping to numerical hours (morning→9, afternoon→15, evening→20, day→12)
○ Day_of_Week encoding
b. Categorical Variable Encoding:
○ Implementation of LabelEncoder for categorical features
○ Binary encoding for boolean features (Traffic_Lights, Alcohol_Involvement)
○ Preservation of encoding mappings for future predictions
c. Feature Selection:
○ Temporal components
○ Geographical coordinates
○ Environmental conditions
○ Traffic parameters
○ Vehicle and driver characteristics
○ Accident statistics

3. Data Preparation
The implementation utilizes a structured train-test split approach:
○ 80-20 split ratio (training-testing)
○ Stratification by target variable (Accident_Severity)
○ Feature scaling using StandardScaler
○ Preservation of scaler parameters for prediction pipeline

4. Model Configuration
The Random Forest Classifier is configured with specific parameters for optimal performance:
○ 100 estimators (trees)
○ Maximum depth of 10 levels
○ Minimum samples split of 5
○ Minimum samples leaf of 2
○ Balanced class weights to handle potential class imbalance

5. Training and Evaluation Pipeline
The evaluation process includes:
a. Model training on scaled features
b. Prediction generation on test set
c. Comprehensive performance metrics:
○ Classification report with precision, recall, and F1-score
○ Confusion matrix visualization
○ Feature importance analysis

6. Visualization Components
The implementation includes several visualization elements:
a. Feature Importance Plot:
○ Bar chart of top 10 influential features
○ Importance scores based on Random Forest feature importance
b. Confusion Matrix Heatmap:
○ Visual representation of model predictions versus actual values
○ Color-coded for easy interpretation
○ Annotated with specific counts

7. Prediction System
The prediction pipeline implements:
a. Data Preprocessing:
○ Application of saved scalers and encoders
○ Feature alignment with training data
b. Prediction Generation:
○ Class prediction
○ Probability distribution across severity classes
○ Conversion of numerical predictions to original severity labels

Below is the full code with additional comments embedded.

 import pandas as pd
 from sklearn.model_selection import train_test_split
 from sklearn.preprocessing import LabelEncoder, StandardScaler
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.metrics import classification_report, confusion_matrix
 import seaborn as sns
 import matplotlib.pyplot as plt

 def load_data(file_path):
    Load data from a CSV file with specific column handling 
        # Read CSV with all columns as strings initially to prevent any parsing errors
        df = pd.read_csv('path_to_csv_file\\accident.csv', dtype=str)

        # Convert numeric columns to appropriate types
        numeric_columns = ['Latitude', 'Longitude', 'Involved_Vehicles',
                           'Injury_Count', 'Fatalities', 'Driver_Age']
        for col in numeric_columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')

        print(f"Successfully loaded {len(df)} records from the dataset")
        return df
    except Exception as e:
        print(f"Error loading the CSV file: {e}")
        return None

 def preprocess_data(df):
    Preprocess the data for model training 
    # Create a copy to avoid modifying the original dataframe
    df = df.copy()

    # Convert Date to datetime
    df['Date'] = pd.to_datetime(df['Date'])

    # Extract time-based features
    df['Month'] = df['Date'].dt.month
    df['Day'] = df['Date']

    # Map Time_of_Day directly to hours (no datetime conversion needed)
    time_mapping = {
        'morning': 9,
        'afternoon': 15,
        'evening': 20,
        'day': 12
    df['Hour'] = df['Time_of_Day'].map(time_mapping)

    # Categorical columns for encoding
    categorical_columns = [

    # Initialize dictionary for label encoders
    label_encoders = {}

    # Encode categorical variables
    for column in categorical_columns:
        label_encoders[column] = LabelEncoder()
        df[column] = label_encoders[column].fit_transform(df[column])

    # Convert boolean columns to integer
    boolean_columns = ['Traffic_Lights', 'Alcohol_Involvement']
    for column in boolean_columns:
        df[column] = df[column].map({'Yes': 1, 'No': 0})

    # Define features for the model
    features = [
        'Latitude', 'Longitude', 'Month', 'Day', 'Hour',
        'Road_Type', 'Weather_Condition', 'Traffic_Condition',
        'Day_of_Week', 'Involved_Vehicles', 'Injury_Count',
        'Fatalities', 'Traffic_Lights', 'Alcohol_Involvement',
        'Driver_Age', 'Vehicle_Type', 'Road_Condition'

    # Define target variable
    target = 'Accident_Severity'

    # Encode target variable
    label_encoders[target] = LabelEncoder()
    df[target] = label_encoders[target].fit_transform(df[target])

    return df, features, target, label_encoders

 def train_model(df, features, target):
    Train the Random Forest model 
    # Prepare feature matrix and target vector
    X = df[features]
    y = df[target]

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y

    # Scale the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train the model with class weight consideration
    model = RandomForestClassifier(
    ), y_train)

    return model, scaler, X_test_scaled, y_test, X_train, y_train

 def evaluate_model(model, X_test, y_test, feature_names, label_encoders):
    Evaluate the model and display results 
    # Make predictions
    y_pred = model.predict(X_test)

    # Get original class names
    severity_encoder = label_encoders['Accident_Severity']
    class_names = severity_encoder.classes_

    # Print classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=class_names))

    # Calculate and display feature importance
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)

    print("\nTop 10 Most Important Features:")

    # Plot feature importance
    plt.figure(figsize=(12, 6))
    sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
    plt.title('Top 10 Most Important Features')

    # Plot confusion matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
    plt.title('Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')

    return feature_importance

 def predict_new_accident(model, scaler, new_data, features, label_encoders):
    Make prediction for a new accident 
    # Prepare the new data
    new_data = new_data[features].copy()

    # Scale the features
    new_data_scaled = scaler.transform(new_data)

    # Make prediction
    prediction = model.predict(new_data_scaled)
    probabilities = model.predict_proba(new_data_scaled)

    # Convert numerical prediction back to original class name
    severity_encoder = label_encoders['Accident_Severity']
    prediction_class = severity_encoder.inverse_transform(prediction)

    # Get class names for probability distribution
    class_names = severity_encoder.classes_

    # Create probability distribution dictionary
    prob_dist = {class_name: prob for class_name, prob in zip(class_names, probabilities[0])}

    return prediction_class[0], prob_dist

 # Main execution
 if __name__ == "__main__":
    # File paths
    input_file = "accident_data.csv"  # Replace with your actual file path

    # Load the data
    df = load_data(input_file)

    if df is not None:
        # Preprocess data
        df, features, target, label_encoders = preprocess_data(df)

        # Train model
        model, scaler, X_test, y_test, X_train, y_train = train_model(df, features, target)

        # Evaluate model
        feature_importance = evaluate_model(model, X_test, y_test, features, label_encoders)

        # Example of prediction for a new accident
        # Use the first row of test data as an example
        new_accident = df.iloc[[0]]
        prediction, probabilities = predict_new_accident(
            model, scaler, new_accident, features, label_encoders)

        print("\nExample Prediction:")
        print(f"Predicted Severity: {prediction}")
        print("\nProbability Distribution:")
        for severity, prob in probabilities.items():
            print(f"{severity}: {prob:.2%}")






