## Road Repair Cost Prediction

This project aims to predict the total cost of road repairs using a Linear Regression model.
The process involves data preprocessing, training a machine learning model, evaluating its
performance, and visualizing the results.

This structured approach demonstrates how data preprocessing, machine learning, and
visualization techniques can be integrated to develop a predictive model for road repair
costs. The results offer valuable insights for city planners and engineers in budget
forecasting and resource allocation for road maintenance projects.

**Import Necessary Libraries**

These libraries are 'pandas' for data manipulation and analysis; 'numpy' for numerical
operations; 'matplotlib' for plotting graphs; 'seaborn' for statistical data visualization;
'sklearn.model_selection.train_test_split' to split data into training and testing sets;
'sklearn.preprocessing.StandardScaler' to standardize features;
'sklearn.preprocessing.LabelEncoder' to encode categorical variables;
'sklearn.linear_model.LinearRegression' to perform linear regression; 'sklearn.metrics' for
evaluating the model's performance.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score

**Load the Data**

Load the dataset containing road repair data from a CSV file into a pandas DataFrame.

data = pd.read_csv("file_location")

**Encode the Categorical Variables**

Convert categorical variables (Type and Condition) into numerical format, as machine
learning models require numerical inputs.

le = LabelEncoder() data["Type"] = le.fit_transform(data["Type"]) data["Condition"] = le.fit_transform(data["Condition"])

**Separate the Target Variable and Features**

Target Variable: 'Total_Cost', which we want to predict. Features: All other columns used as
inputs to the model.

y = data["Total_Cost"] X = data.drop("Total_Cost", axis=1)

**Scale the Features**
Scale the features to have zero mean and unit variance, which helps in improving the
performance of the machine learning model.

scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

**Split the Data into Training and Testing Sets**

Training Set: 70% of the data, used to train the model. Testing Set: 30% of the data, used
to evaluate the model. Random State: ensures reproducibility of the results.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

**Train and evaluate the Model**

Model Initialization: Create an instance of the Linear Regression model. Model Training: Fit
the model to the training data.

Predictions: Use the trained model to predict the Total_Cost on the testing set.

Mean Squared Error (MSE): Measure of the average squared difference between actual and
predicted values. Lower values indicate better performance.

R-squared Score: Indicates the proportion of variance in the dependent variable that is
predictable from the independent variables. Values closer to 1 indicate better performance.

model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred)

**Visualize the Model's Predictions**

Scatter Plot: Plot actual vs. predicted values to visualize the model's performance.
Diagonal Line: The dashed line represents a perfect prediction. The closer the points are to
this line, the better the model's predictions.

plt.scatter(y_test, y_pred, alpha=0.5) plt.xlabel("Actual Total Cost") plt.ylabel("Predicted Total Cost") plt.title("Actual vs. Predicted Total Cost of Road Repairs") plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], "k--", linewidth=2) plt.show()

Below is the full code with additional comments embedded.

# Import the necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # Load the data data = pd.read_csv("file_location") # Encode the categorical variables (Type and Condition) le = LabelEncoder() data["Type"] = le.fit_transform(data["Type"]) data["Condition"] = le.fit_transform(data["Condition"]) # Separate the target variable and features y = data["Total_Cost"] X = data.drop("Total_Cost", axis=1) # Scale the features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Split the data into training and testing sets # Split the data (70% for training and 30% for testing) X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42) # Train the model # Initialize the model model = LinearRegression() # Train the model model.fit(X_train, y_train) # Evaluate the model # Make predictions on the test set y_pred = model.predict(X_test) # Calculate the mean squared error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse) # Calculate the R-squared score r2 = r2_score(y_test, y_pred) print("R-squared Score:", r2) # Visualize the model's predictions # Create a plot for actual vs. predicted values plt.scatter(y_test, y_pred, alpha=0.5) plt.xlabel("Actual Total Cost") plt.ylabel("Predicted Total Cost") plt.title("Actual vs. Predicted Total Cost of Road Repairs") plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], "k--", linewidth=2) plt.show()