Road Repair Cost Prediction
This project aims to predict the total cost of road repairs using a Linear Regression model.
The process involves data preprocessing, training a machine learning model, evaluating its
performance, and visualizing the results.
This structured approach demonstrates how data preprocessing, machine learning, and
visualization techniques can be integrated to develop a predictive model for road repair
costs. The results offer valuable insights for city planners and engineers in budget
forecasting and resource allocation for road maintenance projects.
Import Necessary Libraries
These libraries are 'pandas' for data manipulation and analysis; 'numpy' for numerical
operations; 'matplotlib' for plotting graphs; 'seaborn' for statistical data visualization;
'sklearn.model_selection.train_test_split' to split data into training and testing sets;
'sklearn.preprocessing.StandardScaler' to standardize features;
'sklearn.preprocessing.LabelEncoder' to encode categorical variables;
'sklearn.linear_model.LinearRegression' to perform linear regression; 'sklearn.metrics' for
evaluating the model's performance.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score
Load the Data
Load the dataset containing road repair data from a CSV file into a pandas DataFrame.
data = pd.read_csv("file_location")
Encode the Categorical Variables
Convert categorical variables (Type and Condition) into numerical format, as machine
learning models require numerical inputs.
le = LabelEncoder() data["Type"] = le.fit_transform(data["Type"]) data["Condition"] = le.fit_transform(data["Condition"])
Separate the Target Variable and Features
Target Variable: 'Total_Cost', which we want to predict. Features: All other columns used as
inputs to the model.
y = data["Total_Cost"] X = data.drop("Total_Cost", axis=1)
Scale the Features Scale the features to have zero mean and unit variance, which helps in improving the performance of the machine learning model.
scaler = StandardScaler() X_scaled = scaler.fit_transform(X)
Split the Data into Training and Testing Sets
Training Set: 70% of the data, used to train the model. Testing Set: 30% of the data, used
to evaluate the model. Random State: ensures reproducibility of the results.
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
Train and evaluate the Model
Model Initialization: Create an instance of the Linear Regression model. Model Training: Fit
the model to the training data.
Predictions: Use the trained model to predict the Total_Cost on the testing set.
Mean Squared Error (MSE): Measure of the average squared difference between actual and
predicted values. Lower values indicate better performance.
R-squared Score: Indicates the proportion of variance in the dependent variable that is
predictable from the independent variables. Values closer to 1 indicate better performance.
model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred)
Visualize the Model's Predictions
Scatter Plot: Plot actual vs. predicted values to visualize the model's performance.
Diagonal Line: The dashed line represents a perfect prediction. The closer the points are to
this line, the better the model's predictions.
plt.scatter(y_test, y_pred, alpha=0.5) plt.xlabel("Actual Total Cost") plt.ylabel("Predicted Total Cost") plt.title("Actual vs. Predicted Total Cost of Road Repairs") plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], "k--", linewidth=2) plt.show()
Below is the full code with additional comments embedded.
# Import the necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # Load the data data = pd.read_csv("file_location") # Encode the categorical variables (Type and Condition) le = LabelEncoder() data["Type"] = le.fit_transform(data["Type"]) data["Condition"] = le.fit_transform(data["Condition"]) # Separate the target variable and features y = data["Total_Cost"] X = data.drop("Total_Cost", axis=1) # Scale the features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Split the data into training and testing sets # Split the data (70% for training and 30% for testing) X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42) # Train the model # Initialize the model model = LinearRegression() # Train the model model.fit(X_train, y_train) # Evaluate the model # Make predictions on the test set y_pred = model.predict(X_test) # Calculate the mean squared error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse) # Calculate the R-squared score r2 = r2_score(y_test, y_pred) print("R-squared Score:", r2) # Visualize the model's predictions # Create a plot for actual vs. predicted values plt.scatter(y_test, y_pred, alpha=0.5) plt.xlabel("Actual Total Cost") plt.ylabel("Predicted Total Cost") plt.title("Actual vs. Predicted Total Cost of Road Repairs") plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], "k--", linewidth=2) plt.show()