Traffic Accident Prediction
This Python script is designed for analyzing and predicting traffic accident severity using
machine learning. It processes accident-related data, trains a predictive model, evaluates
its performance, and provides insights into the key features influencing accidents. It
employs a Random Forest classifier to analyze various features including temporal, spatial, environmental, and driver-related factors to predict accident severity levels (minor, moderate, severe).
Overview:
1. Data Input and Initial Processing
The implementation begins with raw CSV data containing accident records with multiple
features:
○ Temporal features (Date, Time_of_Day)
○ Spatial coordinates (Latitude, Longitude)
○ Environmental conditions (Weather_Condition, Road_Condition)
○ Traffic-related features (Traffic_Condition, Traffic_Lights)
○ Vehicle and driver information (Vehicle_Type, Driver_Age)
○ Accident outcomes (Injury_Count, Fatalities)
The initial data loading phase employs pandas' read_csv function with specific dtype
handling to ensure proper data type assignment, particularly for numerical columns that
require precise processing.
2. Feature Engineering
The preprocessing stage implements several key transformations:
a. Temporal Feature Extraction:
○ Date parsing to extract month and day
○ Time_of_Day mapping to numerical hours (morning→9, afternoon→15, evening→20, day→12)
○ Day_of_Week encoding
b. Categorical Variable Encoding:
○ Implementation of LabelEncoder for categorical features
○ Binary encoding for boolean features (Traffic_Lights, Alcohol_Involvement)
○ Preservation of encoding mappings for future predictions
c. Feature Selection:
○ Temporal components
○ Geographical coordinates
○ Environmental conditions
○ Traffic parameters
○ Vehicle and driver characteristics
○ Accident statistics
3. Data Preparation
The implementation utilizes a structured train-test split approach:
○ 80-20 split ratio (training-testing)
○ Stratification by target variable (Accident_Severity)
○ Feature scaling using StandardScaler
○ Preservation of scaler parameters for prediction pipeline
4. Model Configuration
The Random Forest Classifier is configured with specific parameters for optimal performance:
○ 100 estimators (trees)
○ Maximum depth of 10 levels
○ Minimum samples split of 5
○ Minimum samples leaf of 2
○ Balanced class weights to handle potential class imbalance
5. Training and Evaluation Pipeline
The evaluation process includes:
a. Model training on scaled features
b. Prediction generation on test set
c. Comprehensive performance metrics:
○ Classification report with precision, recall, and F1-score
○ Confusion matrix visualization
○ Feature importance analysis
6. Visualization Components
The implementation includes several visualization elements:
a. Feature Importance Plot:
○ Bar chart of top 10 influential features
○ Importance scores based on Random Forest feature importance
b. Confusion Matrix Heatmap:
○ Visual representation of model predictions versus actual values
○ Color-coded for easy interpretation
○ Annotated with specific counts
7. Prediction System
The prediction pipeline implements:
a. Data Preprocessing:
○ Application of saved scalers and encoders
○ Feature alignment with training data
b. Prediction Generation:
○ Class prediction
○ Probability distribution across severity classes
○ Conversion of numerical predictions to original severity labels
Below is the full code with additional comments embedded.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix import seaborn as sns import matplotlib.pyplot as plt def load_data(file_path): """ Load data from a CSV file with specific column handling """ try: # Read CSV with all columns as strings initially to prevent any parsing errors df = pd.read_csv('path_to_csv_file\\accident.csv', dtype=str) # Convert numeric columns to appropriate types numeric_columns = ['Latitude', 'Longitude', 'Involved_Vehicles', 'Injury_Count', 'Fatalities', 'Driver_Age'] for col in numeric_columns: df[col] = pd.to_numeric(df[col], errors='coerce') print(f"Successfully loaded {len(df)} records from the dataset") return df except Exception as e: print(f"Error loading the CSV file: {e}") return None def preprocess_data(df): """ Preprocess the data for model training """ # Create a copy to avoid modifying the original dataframe df = df.copy() # Convert Date to datetime df['Date'] = pd.to_datetime(df['Date']) # Extract time-based features df['Month'] = df['Date'].dt.month df['Day'] = df['Date'].dt.day # Map Time_of_Day directly to hours (no datetime conversion needed) time_mapping = { 'morning': 9, 'afternoon': 15, 'evening': 20, 'day': 12 } df['Hour'] = df['Time_of_Day'].map(time_mapping) # Categorical columns for encoding categorical_columns = [ 'Road_Type', 'Weather_Condition', 'Traffic_Condition', 'Time_of_Day', 'Day_of_Week', 'Vehicle_Type', 'Road_Condition' ] # Initialize dictionary for label encoders label_encoders = {} # Encode categorical variables for column in categorical_columns: label_encoders[column] = LabelEncoder() df[column] = label_encoders[column].fit_transform(df[column]) # Convert boolean columns to integer boolean_columns = ['Traffic_Lights', 'Alcohol_Involvement'] for column in boolean_columns: df[column] = df[column].map({'Yes': 1, 'No': 0}) # Define features for the model features = [ 'Latitude', 'Longitude', 'Month', 'Day', 'Hour', 'Road_Type', 'Weather_Condition', 'Traffic_Condition', 'Day_of_Week', 'Involved_Vehicles', 'Injury_Count', 'Fatalities', 'Traffic_Lights', 'Alcohol_Involvement', 'Driver_Age', 'Vehicle_Type', 'Road_Condition' ] # Define target variable target = 'Accident_Severity' # Encode target variable label_encoders[target] = LabelEncoder() df[target] = label_encoders[target].fit_transform(df[target]) return df, features, target, label_encoders def train_model(df, features, target): """ Train the Random Forest model """ # Prepare feature matrix and target vector X = df[features] y = df[target] # Split the data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Scale the features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train the model with class weight consideration model = RandomForestClassifier( n_estimators=100, max_depth=10, min_samples_split=5, min_samples_leaf=2, class_weight='balanced', random_state=42 ) model.fit(X_train_scaled, y_train) return model, scaler, X_test_scaled, y_test, X_train, y_train def evaluate_model(model, X_test, y_test, feature_names, label_encoders): """ Evaluate the model and display results """ # Make predictions y_pred = model.predict(X_test) # Get original class names severity_encoder = label_encoders['Accident_Severity'] class_names = severity_encoder.classes_ # Print classification report print("\nClassification Report:") print(classification_report(y_test, y_pred, target_names=class_names)) # Calculate and display feature importance feature_importance = pd.DataFrame({ 'feature': feature_names, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\nTop 10 Most Important Features:") print(feature_importance.head(10)) # Plot feature importance plt.figure(figsize=(12, 6)) sns.barplot(x='importance', y='feature', data=feature_importance.head(10)) plt.title('Top 10 Most Important Features') plt.tight_layout() plt.show() # Plot confusion matrix plt.figure(figsize=(8, 6)) cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names) plt.title('Confusion Matrix') plt.ylabel('True Label') plt.xlabel('Predicted Label') plt.tight_layout() plt.show() return feature_importance def predict_new_accident(model, scaler, new_data, features, label_encoders): """ Make prediction for a new accident """ # Prepare the new data new_data = new_data[features].copy() # Scale the features new_data_scaled = scaler.transform(new_data) # Make prediction prediction = model.predict(new_data_scaled) probabilities = model.predict_proba(new_data_scaled) # Convert numerical prediction back to original class name severity_encoder = label_encoders['Accident_Severity'] prediction_class = severity_encoder.inverse_transform(prediction) # Get class names for probability distribution class_names = severity_encoder.classes_ # Create probability distribution dictionary prob_dist = {class_name: prob for class_name, prob in zip(class_names, probabilities[0])} return prediction_class[0], prob_dist # Main execution if __name__ == "__main__": # File paths input_file = "accident_data.csv" # Replace with your actual file path # Load the data df = load_data(input_file) if df is not None: # Preprocess data df, features, target, label_encoders = preprocess_data(df) # Train model model, scaler, X_test, y_test, X_train, y_train = train_model(df, features, target) # Evaluate model feature_importance = evaluate_model(model, X_test, y_test, features, label_encoders) # Example of prediction for a new accident # Use the first row of test data as an example new_accident = df.iloc[[0]] prediction, probabilities = predict_new_accident( model, scaler, new_accident, features, label_encoders) print("\nExample Prediction:") print(f"Predicted Severity: {prediction}") print("\nProbability Distribution:") for severity, prob in probabilities.items(): print(f"{severity}: {prob:.2%}")