Development of an Optimized Accident Prediction Model for Construction Safety Using Big Data Analysis
Article information
Abstract
Background and objective
The objective of this study was to develop an intuitive accident prediction model that can be readily applied at construction sites to effectively reduce the incidence of accidents through the analysis of construction accident data.
Methods
These accidents significantly contribute to the construction industry's overall accident rate. Accident types were categorized into fatalities, injuries, and material damages to construct the accident prediction model. A total of 24 factors were considered across eight major variables, which were identified during the first and second phases of data preprocessing to analyze construction accident big data. Machine learning techniques were employed, specifically supervised learning and ensemble learning, to identify the optimal predictive model.
Results
Among the models tested, XGBoost emerged as the most effective due to its highly balanced accuracy, even in the presence of class imbalance.
Conclusion
The implementation of the XGBoost accident prediction model, along with the feature importance codes developed in this study, enables the prediction of accident types for specific tasks performed at construction sites. This predictive capability is expected to inform the implementation of targeted accident prevention measures, such as enhancing safety protocols or adjusting work procedures based on the prediction outcomes.
Introduction
Background and Purpose of the Study
As of 2022, workers in the domestic construction industry comprise only 12.4% of the total workforce but account for 24.0% of all accident victims, indicating a risk more than twice as high as in other industries. The construction sector also has the highest fatality rate among all industries, representing 24.3% of total fatalities, highlighting the persistent occurrence of fatal safety accidents.
This study aims to develop an accident prediction model capable of proactively preventing accidents and supporting construction safety policies by analyzing big data on construction safety incidents. Through systematic analysis of construction accident data using big data and machine learning technologies, the study seeks to enhance on-site safety management by predicting the likelihood of accidents.
In response to these challenges, the government has prioritized national safety as a core policy objective, implementing various measures to reduce fatalities caused by industrial accidents. Among these measures is the 'Serious Accident Punishment Act,' effective since January 2022, which mandates a minimum of one year of imprisonment for business owners or managers who neglect safety measures, leading to serious accidents. Despite these policy efforts, the incidence of construction safety accidents has continued to rise.
As shown in Fig.1, the construction accident rate has not significantly decreased, underscoring the limitations of existing safety management methods.
Scope and Method of the Study
In 2019, the Korean government revised the Construction Safety Management Integrated Information System (CSI) to efficiently manage essential data, including construction accident statistics, and to facilitate the shared use of this information. This study utilizes construction accident data accumulated in the CSI from the second half of 2019 to 2023. Of the total 22,997 cases, 21,063 cases, after processing for missing values, were used for big data analysis of construction accidents.
The research procedures for building an accident prediction model through big data analysis of construction accidents consist of four stages, as shown in Fig. 2.
Theoretical Review and Research Trends
The Korean government established the Construction Safety Management Integrated Information System (CSI) in the second half of 2019 to efficiently manage construction accident statistics and other safety-related data, promoting their shared use. The system requires construction project participants to report relevant incidents to the CSI whenever construction accidents occur.
Research utilizing CSI-related data is applied in various ways, such as analyzing accident cases at small construction sites using data from the Korea Safety and Security Agency, managing safety at small and medium-sized building remodeling sites, implementing smart safety integrated control technologies, and enhancing safety management at construction sites. This study provides foundational data that can enhance the system's functionality by offering a broad range of predictive information on construction accidents, including accident types, causes, risk factors, and preventive measures. This includes predicting accident types such as fatalities, injuries, and material damages. Jun(2015) has focused on introducing methodologies rather than new analytical algorithms for a long time, covering everything from the collection of big data to the construction of structured data, visualization, and the use of statistical analysis. Hola et al. (2017) conducted a correlation analysis between construction production values and the number of occupational accidents to identify factors influencing the accident rate in the construction industry. Heo et al. (2018) analyzed the frequency of accidents by type and derived preventive measures by examining risk factors associated with different types of construction site accidents, particularly at larger sites. Additionally, the application of machine learning to reliability engineering and safety analysis has been demonstrated Xu and Saleh (2021), who provided examples of its use in various scenarios. Lee and Kim (2021) proposed a risk assessment plan for different construction types in road bridge work, using Monte Carlo simulation to analyze probabilistic risks in construction safety management. For the prediction and prevention of accidents at construction sites. Awolusi et al. (2022) proposed a safety performance measurement framework and a statistical modeling process. Choi et al. (2021) constructed a model for predicting the risk of fatal accidents at construction sites, utilizing machine learning based on statistical data from industrial accident fatalities. Yoon et al., (2021) conducted a text mining-based customized analysis of four accident types using construction accident data. Ashtari et al. (2022) employed an interpretable machine learning model that considers the potential relationships between risk factors to predict cost overruns in construction projects and evaluate related risks.
Construction Accident Big Data Analysis
Data Preprocessing
The construction accident big data used in this study consists of 22,997 raw data points with 51 variables and 273 elements accumulated in the CSI from the second half of 2019 to 2023. The initial data preprocessing task involved compiling usable variables to train suitable models for this study. This included formatting the data, handling missing values, and applying level constraints to categorical variables to prevent overfitting. The result was a dataset comprising 21,063 construction accident records, including 19 variables and 26 elements, as detailed in Table 1.
In the second phase of data preprocessing, the focus was on performing variable importance analysis to select the most influential variables from the valid set, ensuring efficient and systematic big data analysis. The procedures used for variable selection in big data analysis are critical for improving model performance by enhancing prediction accuracy, preventing overfitting, reducing data dimensionality, and conserving computational resources.
Common methods for variable importance analysis include Principal Component Analysis (PCA), Chi-Square Analysis, Correlation Analysis, Cramér’s V, Feature Importance, and Clustering methods. In this study, Correlation Analysis, which measures the linear relationship between continuous variables, and Cramér’s V, which assesses the association between two categorical variables but only indicates the strength of the relationship without directly measuring the impact on the dependent variable, were excluded.
Evaluation and Selection of Variable Importance
In this study, various machine learning models were applied to analyze data and build prediction models aimed at improving prediction performance. The analysis of variable importance was crucial for identifying key predictive variables and selecting the optimal model.
(1) Principal Component Analysis
Principal Component Analysis (PCA) is a technique used to identify the key features of important data by calculating the covariance matrix from normalized data and finding axes that maximize variance to reduce data dimensionality. Typically, the first principal component explains the most variance, with subsequent components explaining progressively less. The appropriate number of principal components can be determined by identifying the point where the size of the eigenvalue decreases rapidly. In this study, only principal components with an eigenvalue of 1 or greater were retained, and variables were selected by reducing the data to eight dimensions, excluding components below the 'elbow point.' Visualizing the analysis results of the principal components PC1 and PC2, which explain the largest variability in the data, shows arrows representing each variable based on the principal component scores. These arrows indicate how much each variable contributes to the principal components and suggest their correlations through the length and direction of the arrows. However, to compare the contribution of all variables at a glance, the importance of the variables was further derived using a heat-map and bar chart, as shown in Fig. 3.
(2) Chi-Square Analysis
The Chi-Square analysis is commonly used to assess the independence between two categorical variables by constructing a contingency table and calculating the Chi-Square statistic based on the difference between the observed and expected frequencies.3) Feature Importance Analysis.
If the p-value calculated from the Chi-Square distribution is smaller than the critical value (typically 0.05), the null hypothesis of independence is rejected, indicating a relationship between the variables. In this study, 18 out of 19 variables, excluding the dependent variable, were confirmed to have significant relationships through Chi-Square tests. Based on the results of the Chi-Square tests, key variables were selected by sorting the variables in descending order of Chi-Square statistics and log p-values, which reflect the strength of the association between the variables, as shown in Fig.4.
(3) Feature Importance Analysis
Feature importance was evaluated using the random forest model by training the data and calculating the importance of each variable based on the Mean Decrease in Accuracy and Mean Decrease in Gini values. The results are visualized in Fig. 5.
Mean Decrease in Accuracy indicates how much the model's accuracy decreases when each variable is removed, while Mean Decrease in Gini reflects the degree to which each variable contributes to reducing Gini impurity. Therefore, for both metrics, higher values imply that a variable is more important for classification. Variables were ranked based on their 'Mean Decrease in Accuracy' and 'Mean Decrease in Gini' values, as shown in Fig. 5, and Fig. 6 sorts the variables by Overall Importance, which is the standardized sum of the two metrics, to select key variables.
(4) Clustering Analysis
Clustering is a statistical technique used to identify hidden (latent) groups or clusters within observed data. This method operates efficiently, even with large datasets. K-means clustering, which allows the direct specification of the number of clusters, was applied in this study.
Clustering divides the entire dataset into multiple subclusters with similar characteristics, facilitating the identification of the relative importance of variables. The number of clusters significantly affects the analysis results; having too few or too many clusters can reduce the effectiveness of the analysis. In this study, the number of clusters was determined to be four by assessing the AIC, BIC, and Entropy values to ensure model fit, satisfying the condition of having at least two variables in more than three clusters. The seven variables corresponding to the two clusters with effective model fit, indicated by lower AIC and BIC values, were selected, as shown in Fig. 7.
As shown above, the key variables selected through principal component analysis, Chi-Square tests, feature importance, and clustering methods are the eight variables with the most common overlap in the Chi-Square method, as listed in Table 2. These key variables, in order, are: human accident, damage amount, accident object (medium classification), accident object (large classification), construction cost, work process, construction type (medium classification), and number of workers. The methods used to select major variables confirm that similar significant variables were identified when directly measuring their influence on the dependent variable.
Verification of Preprocessing
To assess the appropriateness of the final selected variables, a comparative verification was conducted using logistic regression, a representative modeling technique that illustrates the effect of independent variables on dependent variables. This method is particularly relevant when determining significant factors in major accident analysis. Initially, all variables were visualized, as depicted in Fig. 8. The visualization excluded non-significant variables, allowing for an examination of the extent to which the odds of the dependent variable (exp(β), where β is the coefficient estimate) are influenced by changes in the independent variables' coefficients.
If the odds ratio exceeds 1, it indicates that an increase in the independent variable raises the likelihood of the dependent variable's success. Conversely, a ratio less than 1 suggests a decrease in success probability. For instance, a coefficient estimate of 0.5 implies that an increase in the independent variable raises the success probability of the dependent variable by approximately 1.65 times (≈exp (0.5)). The verification confirmed that all variables within the effective range were appropriately included among the final selected key variables across three categories classified based on the extracted coefficient estimates and significance levels.
Determination of Optimal Number of Factors
In the second phase of data preprocessing, key variables were selected, including personal accidents, damage costs, accident objects (medium and large categories), construction costs, work processes, construction type (medium category), and the number of workers. These variables were categorized into 6 to 24 elements (categories of categorical variables), referencing relevant laws, construction performance records, and standard unit costs from the first data preprocessing stage.
To enhance the interpretability of the accident prediction model, prevent overfitting, and improve generalizability, an analysis was conducted on the influence flow and resource allocation among variables with more than 15 elements, such as personal accidents, construction type (medium category), accident objects (medium category), work processes, and construction costs. This analysis, as illustrated in Fig. 9, led to the integration of similar categories and the consolidation of categories with low frequencies, resulting in models comprising 15, 18, and 21 elements.
Table 3 reviews the performance of models with varying numbers of elements using general machine learning models such as multinomial logistic regression, k-nearest neighbors, and decision trees. The results indicate that as the number of elements decreased from 24 to 21, 18, and 15, there was a corresponding decline in most performance indicators, including accuracy, 95% confidence interval, p-value, kappa statistic, and average balanced accuracy. This suggests that further reduction in the number of elements does not enhance model performance with the current dataset. Therefore, to effectively reduce the construction accident rate, a streamlined and intuitive accident prediction model was developed, employing up to 24 elements determined in the second data preprocessing stage.
Evaluation of Construction Safety Prediction Models
In the second data preprocessing task, we utilized a dataset consisting of 21,063 data points and the final eight selected variables to determine the optimal predictive analysis model. We conducted machine learning analyses using various algorithms: Multinomial Logistic Regression, k-NN (k-Nearest Neighbors), Decision Tree, SVM (Support Vector Machine), Bagging (Bootstrap Aggregating), Random Forest, and XGBoost (eXtreme Gradient Boosting). These models were categorized into supervised learning (Multinomial Logistic Regression, k-NN, Decision Tree, SVM) and ensemble learning (Bagging, Random Forest, XGBoost).
Initially, supervised learning models were trained on labeled training data, with predictions generated for new data. The predictive accuracy and performance were evaluated using confusion matrices that compared actual and predicted values, as shown in Table 4.
Given the balanced accuracy, a critical metric for evaluating performance in datasets with class imbalance, Multinomial Logistic Regression emerged as the optimal predictive analysis model among the supervised learning models. To further validate the prediction results and performance of the ensemble learning models, we generated confusion matrices comparing actual and predicted values. The performance metrics for these models are presented in Table 5.
A comprehensive evaluation of all seven models indicated that XGBoost, despite the class imbalance, achieved the highest balanced accuracy, establishing it as the optimal predictive analysis model. XGBoost learns decision trees sequentially to correct errors, supports parallel processing for enhanced learning speed, and incorporates regularization to prevent overfitting, making it effective across diverse datasets. Furthermore, it is essential to emphasize the rationale for selecting the XGBoost model over other advanced machine learning techniques, such as deep learning, random forests, or recently developed reinforcement learning models. While deep learning models may offer higher accuracy in some contexts, they often require extensive computational resources and large amounts of data, which may not always be feasible in real-world applications. XGBoost provides a good balance between predictive performance and computational efficiency, making it suitable for scenarios where quick decision-making is crucial. Moreover, the model's ability to handle various types of data and its robustness against overfitting adds to its appeal. A thorough comparison of these methods, including a discussion of their tradeoffs, would further elucidate the rationale behind the application of XGBoost in this study. The feature importance and visualization of the XGBoost model, as illustrated in Fig. 10, provided insights into the features used by the model. By identifying each feature's contribution, we confirmed the relative importance of features to the model's predictive performance.
The features contributing most to the model's predictive performance, in order, were:
1. Personal accidents (accident type)
2. Damage costs
3. Construction costs
4. Accident objects (medium and large categories)
5. Work processes
6. Construction type (medium category)
7. Number of workers
The importance of these features is summarized in Table 6.
Accident Prediction Model Selection
In this study, the objective function of the XGBoost model, identified as the optimal construction accident prediction model, comprises two components: the logistic loss function (log loss), which measures prediction accuracy, and a regularization term that penalizes model complexity to prevent overfitting. The model learns data patterns to minimize this objective function:
where:
n : Total number of data points
i : Index of the current data point
yi: Actual label of the i-th data point
ŷi : Model's predicted probability for the i-th data point
K : Total number of trees used in the model
k : Index of the current tree
fk : Individual tree corresponding to index
Ω(fk) : Complexity of the k-th tree, usually defined by tree depth or the number of leaf nodes
However, while the XGBoost model demonstrates strong predictive capabilities, it is crucial to discuss the technical infrastructure necessary for its real-time application in the field. Implementing this model in a real-time environment requires a robust hardware and software setup capable of processing data efficiently. For example, leveraging cloudbased solutions can provide the necessary computational resources and scalability. Additionally, integrating the predictive model with IoT sensors and safety management systems can facilitate real-time data collection and analysis, enhancing the model's effectiveness in predicting construction accidents. When predicting new data using the XGBoost model, the model's learned function f can generate the predicted value f(xi) by applying the input variable vector for a new task. Although the model's structure is fixed post-training, periodic data updates and algorithm reviews are necessary due to the evolving nature of big data analysis. Moreover, while the paper states that the XGBoost model is highly accurate even with class imbalance issues, it is essential to elaborate on how we specifically address class imbalance in the dataset. Implementing techniques such as oversampling, undersampling, or using SMOTE (Synthetic Minority Over-sampling Technique) could significantly enhance the model's performance. Evaluating the resulting performance changes after applying these methods would provide deeper insights into their effectiveness and ensure that our predictive model is robust and reliable.
Conclusion
This study developed a construction accident prediction model using data from the Construction Safety Information (CSI) database, spanning from the latter half of 2019 to 2023. The process involved data preprocessing, key variable selection, and the application of various machine learning techniques. During data preprocessing, usable variables were compiled, missing values were addressed, and level constraints for categorical variables were set to avoid overfitting. Major variables were selected using principal component analysis, chi-square analysis, feature importance analysis, and clustering.
The final selected variables comprised eight key factors: personal accidents, damage costs, accident objects (medium and large categories), construction costs, work processes, construction type (medium category), and the number of workers. Using these key variables, machine learning analyses were conducted with supervised and ensemble learning approaches. The XGBoost model was identified as the optimal prediction model due to its high balanced accuracy despite class imbalance.
Moreover, to ensure the model’s applicability in real-world scenarios, future research should focus on the technical infrastructure required for real-time applications, including the integration with IoT sensors and safety management systems. It is also essential to address class imbalance using techniques like oversampling or SMOTE to enhance predictive performance.
As a result, specific interventions, such as enhanced safety measures or adjusted work procedures, can be implemented based on the prediction results to prevent accidents. Future research should also emphasize updating data and improving model performance, applying the model across various construction sites to verify its effectiveness. Additionally, continuous exploration of practical applications for the model in real construction projects is crucial, considering changing work environments and site conditions.