Development of an Optimized Accident Prediction Model for Construction Safety Using Big Data Analysis

Article information

J. People Plants Environ. 2024;27(5):379-391
Publication date (electronic) : 2024 October 31
doi : https://doi.org/10.11628/ksppe.2024.27.5.379
1General manager, Disaster Safety Headquarters, Korea Authority of Land and Infrastructure Safety, Jinju 52352, Republic of Korea
2Professor, Department of Civil and Infrastructure Engineering, Gyeongsang National University, Jinju 52725, Republic of Korea
*Corresponding author: Chang-Hak Kim, chking@gnu.ac.kr
First authorJun-Kue Heo, hjk@kalis.or.kr
Received 2024 September 9; Revised 2024 September 27; Accepted 2024 October 14.

Abstract

Background and objective

The objective of this study was to develop an intuitive accident prediction model that can be readily applied at construction sites to effectively reduce the incidence of accidents through the analysis of construction accident data.

Methods

These accidents significantly contribute to the construction industry's overall accident rate. Accident types were categorized into fatalities, injuries, and material damages to construct the accident prediction model. A total of 24 factors were considered across eight major variables, which were identified during the first and second phases of data preprocessing to analyze construction accident big data. Machine learning techniques were employed, specifically supervised learning and ensemble learning, to identify the optimal predictive model.

Results

Among the models tested, XGBoost emerged as the most effective due to its highly balanced accuracy, even in the presence of class imbalance.

Conclusion

The implementation of the XGBoost accident prediction model, along with the feature importance codes developed in this study, enables the prediction of accident types for specific tasks performed at construction sites. This predictive capability is expected to inform the implementation of targeted accident prevention measures, such as enhancing safety protocols or adjusting work procedures based on the prediction outcomes.

Introduction

Background and Purpose of the Study

As of 2022, workers in the domestic construction industry comprise only 12.4% of the total workforce but account for 24.0% of all accident victims, indicating a risk more than twice as high as in other industries. The construction sector also has the highest fatality rate among all industries, representing 24.3% of total fatalities, highlighting the persistent occurrence of fatal safety accidents.

This study aims to develop an accident prediction model capable of proactively preventing accidents and supporting construction safety policies by analyzing big data on construction safety incidents. Through systematic analysis of construction accident data using big data and machine learning technologies, the study seeks to enhance on-site safety management by predicting the likelihood of accidents.

In response to these challenges, the government has prioritized national safety as a core policy objective, implementing various measures to reduce fatalities caused by industrial accidents. Among these measures is the 'Serious Accident Punishment Act,' effective since January 2022, which mandates a minimum of one year of imprisonment for business owners or managers who neglect safety measures, leading to serious accidents. Despite these policy efforts, the incidence of construction safety accidents has continued to rise.

As shown in Fig.1, the construction accident rate has not significantly decreased, underscoring the limitations of existing safety management methods.

Fig. 1

Accident Rate (Ministry of Employment and Labor).

Scope and Method of the Study

In 2019, the Korean government revised the Construction Safety Management Integrated Information System (CSI) to efficiently manage essential data, including construction accident statistics, and to facilitate the shared use of this information. This study utilizes construction accident data accumulated in the CSI from the second half of 2019 to 2023. Of the total 22,997 cases, 21,063 cases, after processing for missing values, were used for big data analysis of construction accidents.

The research procedures for building an accident prediction model through big data analysis of construction accidents consist of four stages, as shown in Fig. 2.

Fig. 2

Research Procedures.

Theoretical Review and Research Trends

The Korean government established the Construction Safety Management Integrated Information System (CSI) in the second half of 2019 to efficiently manage construction accident statistics and other safety-related data, promoting their shared use. The system requires construction project participants to report relevant incidents to the CSI whenever construction accidents occur.

Research utilizing CSI-related data is applied in various ways, such as analyzing accident cases at small construction sites using data from the Korea Safety and Security Agency, managing safety at small and medium-sized building remodeling sites, implementing smart safety integrated control technologies, and enhancing safety management at construction sites. This study provides foundational data that can enhance the system's functionality by offering a broad range of predictive information on construction accidents, including accident types, causes, risk factors, and preventive measures. This includes predicting accident types such as fatalities, injuries, and material damages. Jun(2015) has focused on introducing methodologies rather than new analytical algorithms for a long time, covering everything from the collection of big data to the construction of structured data, visualization, and the use of statistical analysis. Hola et al. (2017) conducted a correlation analysis between construction production values and the number of occupational accidents to identify factors influencing the accident rate in the construction industry. Heo et al. (2018) analyzed the frequency of accidents by type and derived preventive measures by examining risk factors associated with different types of construction site accidents, particularly at larger sites. Additionally, the application of machine learning to reliability engineering and safety analysis has been demonstrated Xu and Saleh (2021), who provided examples of its use in various scenarios. Lee and Kim (2021) proposed a risk assessment plan for different construction types in road bridge work, using Monte Carlo simulation to analyze probabilistic risks in construction safety management. For the prediction and prevention of accidents at construction sites. Awolusi et al. (2022) proposed a safety performance measurement framework and a statistical modeling process. Choi et al. (2021) constructed a model for predicting the risk of fatal accidents at construction sites, utilizing machine learning based on statistical data from industrial accident fatalities. Yoon et al., (2021) conducted a text mining-based customized analysis of four accident types using construction accident data. Ashtari et al. (2022) employed an interpretable machine learning model that considers the potential relationships between risk factors to predict cost overruns in construction projects and evaluate related risks.

Construction Accident Big Data Analysis

Data Preprocessing

The construction accident big data used in this study consists of 22,997 raw data points with 51 variables and 273 elements accumulated in the CSI from the second half of 2019 to 2023. The initial data preprocessing task involved compiling usable variables to train suitable models for this study. This included formatting the data, handling missing values, and applying level constraints to categorical variables to prevent overfitting. The result was a dataset comprising 21,063 construction accident records, including 19 variables and 26 elements, as detailed in Table 1.

Initial Dataset of the Construction Incident DB

In the second phase of data preprocessing, the focus was on performing variable importance analysis to select the most influential variables from the valid set, ensuring efficient and systematic big data analysis. The procedures used for variable selection in big data analysis are critical for improving model performance by enhancing prediction accuracy, preventing overfitting, reducing data dimensionality, and conserving computational resources.

Common methods for variable importance analysis include Principal Component Analysis (PCA), Chi-Square Analysis, Correlation Analysis, Cramér’s V, Feature Importance, and Clustering methods. In this study, Correlation Analysis, which measures the linear relationship between continuous variables, and Cramér’s V, which assesses the association between two categorical variables but only indicates the strength of the relationship without directly measuring the impact on the dependent variable, were excluded.

Evaluation and Selection of Variable Importance

In this study, various machine learning models were applied to analyze data and build prediction models aimed at improving prediction performance. The analysis of variable importance was crucial for identifying key predictive variables and selecting the optimal model.

(1) Principal Component Analysis

Principal Component Analysis (PCA) is a technique used to identify the key features of important data by calculating the covariance matrix from normalized data and finding axes that maximize variance to reduce data dimensionality. Typically, the first principal component explains the most variance, with subsequent components explaining progressively less. The appropriate number of principal components can be determined by identifying the point where the size of the eigenvalue decreases rapidly. In this study, only principal components with an eigenvalue of 1 or greater were retained, and variables were selected by reducing the data to eight dimensions, excluding components below the 'elbow point.' Visualizing the analysis results of the principal components PC1 and PC2, which explain the largest variability in the data, shows arrows representing each variable based on the principal component scores. These arrows indicate how much each variable contributes to the principal components and suggest their correlations through the length and direction of the arrows. However, to compare the contribution of all variables at a glance, the importance of the variables was further derived using a heat-map and bar chart, as shown in Fig. 3.

Fig. 3

Variable Important in PCA.

(2) Chi-Square Analysis

The Chi-Square analysis is commonly used to assess the independence between two categorical variables by constructing a contingency table and calculating the Chi-Square statistic based on the difference between the observed and expected frequencies.3) Feature Importance Analysis.

If the p-value calculated from the Chi-Square distribution is smaller than the critical value (typically 0.05), the null hypothesis of independence is rejected, indicating a relationship between the variables. In this study, 18 out of 19 variables, excluding the dependent variable, were confirmed to have significant relationships through Chi-Square tests. Based on the results of the Chi-Square tests, key variables were selected by sorting the variables in descending order of Chi-Square statistics and log p-values, which reflect the strength of the association between the variables, as shown in Fig.4.

Fig. 4

Chi-Square Statistics by Variable.

(3) Feature Importance Analysis

Feature importance was evaluated using the random forest model by training the data and calculating the importance of each variable based on the Mean Decrease in Accuracy and Mean Decrease in Gini values. The results are visualized in Fig. 5.

Fig. 5

Mean Decrease Accuracy and Gini.

Mean Decrease in Accuracy indicates how much the model's accuracy decreases when each variable is removed, while Mean Decrease in Gini reflects the degree to which each variable contributes to reducing Gini impurity. Therefore, for both metrics, higher values imply that a variable is more important for classification. Variables were ranked based on their 'Mean Decrease in Accuracy' and 'Mean Decrease in Gini' values, as shown in Fig. 5, and Fig. 6 sorts the variables by Overall Importance, which is the standardized sum of the two metrics, to select key variables.

Fig. 6

Variables by Overall Importance.

(4) Clustering Analysis

Clustering is a statistical technique used to identify hidden (latent) groups or clusters within observed data. This method operates efficiently, even with large datasets. K-means clustering, which allows the direct specification of the number of clusters, was applied in this study.

Clustering divides the entire dataset into multiple subclusters with similar characteristics, facilitating the identification of the relative importance of variables. The number of clusters significantly affects the analysis results; having too few or too many clusters can reduce the effectiveness of the analysis. In this study, the number of clusters was determined to be four by assessing the AIC, BIC, and Entropy values to ensure model fit, satisfying the condition of having at least two variables in more than three clusters. The seven variables corresponding to the two clusters with effective model fit, indicated by lower AIC and BIC values, were selected, as shown in Fig. 7.

Fig. 7

Variables Importance Clustering.

As shown above, the key variables selected through principal component analysis, Chi-Square tests, feature importance, and clustering methods are the eight variables with the most common overlap in the Chi-Square method, as listed in Table 2. These key variables, in order, are: human accident, damage amount, accident object (medium classification), accident object (large classification), construction cost, work process, construction type (medium classification), and number of workers. The methods used to select major variables confirm that similar significant variables were identified when directly measuring their influence on the dependent variable.

Selection of key variables

Verification of Preprocessing

To assess the appropriateness of the final selected variables, a comparative verification was conducted using logistic regression, a representative modeling technique that illustrates the effect of independent variables on dependent variables. This method is particularly relevant when determining significant factors in major accident analysis. Initially, all variables were visualized, as depicted in Fig. 8. The visualization excluded non-significant variables, allowing for an examination of the extent to which the odds of the dependent variable (exp(β), where β is the coefficient estimate) are influenced by changes in the independent variables' coefficients.

Fig. 8

Logistic Regression Coefficients and Significance.

If the odds ratio exceeds 1, it indicates that an increase in the independent variable raises the likelihood of the dependent variable's success. Conversely, a ratio less than 1 suggests a decrease in success probability. For instance, a coefficient estimate of 0.5 implies that an increase in the independent variable raises the success probability of the dependent variable by approximately 1.65 times (≈exp (0.5)). The verification confirmed that all variables within the effective range were appropriately included among the final selected key variables across three categories classified based on the extracted coefficient estimates and significance levels.

Determination of Optimal Number of Factors

In the second phase of data preprocessing, key variables were selected, including personal accidents, damage costs, accident objects (medium and large categories), construction costs, work processes, construction type (medium category), and the number of workers. These variables were categorized into 6 to 24 elements (categories of categorical variables), referencing relevant laws, construction performance records, and standard unit costs from the first data preprocessing stage.

To enhance the interpretability of the accident prediction model, prevent overfitting, and improve generalizability, an analysis was conducted on the influence flow and resource allocation among variables with more than 15 elements, such as personal accidents, construction type (medium category), accident objects (medium category), work processes, and construction costs. This analysis, as illustrated in Fig. 9, led to the integration of similar categories and the consolidation of categories with low frequencies, resulting in models comprising 15, 18, and 21 elements.

Fig. 9

Sankey Diagram Analysis.

Table 3 reviews the performance of models with varying numbers of elements using general machine learning models such as multinomial logistic regression, k-nearest neighbors, and decision trees. The results indicate that as the number of elements decreased from 24 to 21, 18, and 15, there was a corresponding decline in most performance indicators, including accuracy, 95% confidence interval, p-value, kappa statistic, and average balanced accuracy. This suggests that further reduction in the number of elements does not enhance model performance with the current dataset. Therefore, to effectively reduce the construction accident rate, a streamlined and intuitive accident prediction model was developed, employing up to 24 elements determined in the second data preprocessing stage.

Performance Inspection of General Machine Learning Model by Application of Factor Count

Evaluation of Construction Safety Prediction Models

In the second data preprocessing task, we utilized a dataset consisting of 21,063 data points and the final eight selected variables to determine the optimal predictive analysis model. We conducted machine learning analyses using various algorithms: Multinomial Logistic Regression, k-NN (k-Nearest Neighbors), Decision Tree, SVM (Support Vector Machine), Bagging (Bootstrap Aggregating), Random Forest, and XGBoost (eXtreme Gradient Boosting). These models were categorized into supervised learning (Multinomial Logistic Regression, k-NN, Decision Tree, SVM) and ensemble learning (Bagging, Random Forest, XGBoost).

Initially, supervised learning models were trained on labeled training data, with predictions generated for new data. The predictive accuracy and performance were evaluated using confusion matrices that compared actual and predicted values, as shown in Table 4.

Performance Indicators of Supervised Learning Models

Given the balanced accuracy, a critical metric for evaluating performance in datasets with class imbalance, Multinomial Logistic Regression emerged as the optimal predictive analysis model among the supervised learning models. To further validate the prediction results and performance of the ensemble learning models, we generated confusion matrices comparing actual and predicted values. The performance metrics for these models are presented in Table 5.

Performance Indicators of Ensemble Learning Models

A comprehensive evaluation of all seven models indicated that XGBoost, despite the class imbalance, achieved the highest balanced accuracy, establishing it as the optimal predictive analysis model. XGBoost learns decision trees sequentially to correct errors, supports parallel processing for enhanced learning speed, and incorporates regularization to prevent overfitting, making it effective across diverse datasets. Furthermore, it is essential to emphasize the rationale for selecting the XGBoost model over other advanced machine learning techniques, such as deep learning, random forests, or recently developed reinforcement learning models. While deep learning models may offer higher accuracy in some contexts, they often require extensive computational resources and large amounts of data, which may not always be feasible in real-world applications. XGBoost provides a good balance between predictive performance and computational efficiency, making it suitable for scenarios where quick decision-making is crucial. Moreover, the model's ability to handle various types of data and its robustness against overfitting adds to its appeal. A thorough comparison of these methods, including a discussion of their tradeoffs, would further elucidate the rationale behind the application of XGBoost in this study. The feature importance and visualization of the XGBoost model, as illustrated in Fig. 10, provided insights into the features used by the model. By identifying each feature's contribution, we confirmed the relative importance of features to the model's predictive performance.

Fig. 10

Visualize Feature importance.

The features contributing most to the model's predictive performance, in order, were:

  • 1. Personal accidents (accident type)

  • 2. Damage costs

  • 3. Construction costs

  • 4. Accident objects (medium and large categories)

  • 5. Work processes

  • 6. Construction type (medium category)

  • 7. Number of workers

The importance of these features is summarized in Table 6.

Feature Importance Indicators

Accident Prediction Model Selection

In this study, the objective function of the XGBoost model, identified as the optimal construction accident prediction model, comprises two components: the logistic loss function (log loss), which measures prediction accuracy, and a regularization term that penalizes model complexity to prevent overfitting. The model learns data patterns to minimize this objective function:

Σi=1n[yi log (y^i)+(1-yi)log (1-y^i)]+Σk=1KΩ(fk)

where:

  • n : Total number of data points

  • i : Index of the current data point

  • yi: Actual label of the i-th data point

  • ŷi : Model's predicted probability for the i-th data point

  • K : Total number of trees used in the model

  • k : Index of the current tree

  • fk : Individual tree corresponding to index

  • Ω(fk) : Complexity of the k-th tree, usually defined by tree depth or the number of leaf nodes

However, while the XGBoost model demonstrates strong predictive capabilities, it is crucial to discuss the technical infrastructure necessary for its real-time application in the field. Implementing this model in a real-time environment requires a robust hardware and software setup capable of processing data efficiently. For example, leveraging cloudbased solutions can provide the necessary computational resources and scalability. Additionally, integrating the predictive model with IoT sensors and safety management systems can facilitate real-time data collection and analysis, enhancing the model's effectiveness in predicting construction accidents. When predicting new data using the XGBoost model, the model's learned function f can generate the predicted value f(xi) by applying the input variable vector for a new task. Although the model's structure is fixed post-training, periodic data updates and algorithm reviews are necessary due to the evolving nature of big data analysis. Moreover, while the paper states that the XGBoost model is highly accurate even with class imbalance issues, it is essential to elaborate on how we specifically address class imbalance in the dataset. Implementing techniques such as oversampling, undersampling, or using SMOTE (Synthetic Minority Over-sampling Technique) could significantly enhance the model's performance. Evaluating the resulting performance changes after applying these methods would provide deeper insights into their effectiveness and ensure that our predictive model is robust and reliable.

Conclusion

This study developed a construction accident prediction model using data from the Construction Safety Information (CSI) database, spanning from the latter half of 2019 to 2023. The process involved data preprocessing, key variable selection, and the application of various machine learning techniques. During data preprocessing, usable variables were compiled, missing values were addressed, and level constraints for categorical variables were set to avoid overfitting. Major variables were selected using principal component analysis, chi-square analysis, feature importance analysis, and clustering.

The final selected variables comprised eight key factors: personal accidents, damage costs, accident objects (medium and large categories), construction costs, work processes, construction type (medium category), and the number of workers. Using these key variables, machine learning analyses were conducted with supervised and ensemble learning approaches. The XGBoost model was identified as the optimal prediction model due to its high balanced accuracy despite class imbalance.

Moreover, to ensure the model’s applicability in real-world scenarios, future research should focus on the technical infrastructure required for real-time applications, including the integration with IoT sensors and safety management systems. It is also essential to address class imbalance using techniques like oversampling or SMOTE to enhance predictive performance.

As a result, specific interventions, such as enhanced safety measures or adjusted work procedures, can be implemented based on the prediction results to prevent accidents. Future research should also emphasize updating data and improving model performance, applying the model across various construction sites to verify its effectiveness. Additionally, continuous exploration of practical applications for the model in real construction projects is crucial, considering changing work environments and site conditions.

References

Ashtari M.A., Ansari R., Hassannayebi E., Jeong J.W.. 2022;Cost overrun risk assessment and prediction in construction rojects: A bayesian network classifier approach. Buildings 12(10):1660. https://doi.org/10.3390/buildings12101660.
Awolusi I., Marks E., Hainen A., Alzarrad A.. 2022;Incident analysis and prediction of safety performance on construction sites. CivilEng 3(3):669–686. https://doi.org/10.3390/civileng3030039.
Cho M.H.. 2021. R data analysis machine learning History of Information Culture Co., Ltd.
Choi S.J., Kim J.H., Jung K.. 2021;Development of prediction models for fatal accident using proactive
Heo Jk, Choi M.R., Oh K.C., Shin J.Y.. 2018;A Study on the Analysis of Risk Factors and the Reoccurrence Prevention in Construction Site Accidents. Journal of the Korea Institute of Construction Safety 1(1):16–21. https://doi.org/10.20931/jkics.2018.1.1.016.
Hola B., Nowobilski T., Szer I., Szer J.. 2017;Identification of factors affection the accident rate in the construction industry. Science Direct Procedia Engineering 208:35–42. https://doi.org/10.1016/j.pyoeng.2017.11.018.
information in construction sites. Journal of the Korean Society of Safety 36(3):31–39. https://doi.org/10.14346/JKOSOS.2021.36.3.31.
Jun S.H.. 2015;A big data preprocessing using statistical text mining. Journal of korea institute of Intelligent Systems 25(5):470–476. https://doi.org/10.5391/JKIIS.2015.25.5.470.
Jung B.H.. 2021 Let’s big data R-programming MJEC Books Co., Ltd.
Lantz B.. 2020. Machine learning with R Third Editionth ed. Acorn Publishing Co., Ltd.
Lee D.Y., Kim D.E.. 2021;A study on the probabilistic risk analysis for safety management in construction projects. Journal of The Society of Computer and Infor mation 26(8):139–147. https://doi.org/10.9708/jksci.2021.26.08.139.
Templ M.. 2019. Simulation for data science with R Acorn Publishing Co., Ltd.
Xu Z., Saleh J.H.. 2021;Machine learning in reliability engineering and safety applications: Review of current status and future opportunities. Reliability Engineering and System Safety :211. https://doi.org/10.1016/j.ress.2021.107530.
Yang O.S.. 2021. R flex statistical data analysis and visualization Paper-delivered Media Co., Ltd.
Yoon Y.G., Lee J.Y., Oh T.K.. 2022;Test mining-based Data Preprocessing and Accident Type Analysis for Construction Accident Analysis. Journal of the Korean Society of Safely 37(2):18–27. https://doi.org/10.14346/JKOSOS.2022.37.2.18.

Article information Continued

Fig. 1

Accident Rate (Ministry of Employment and Labor).

Fig. 2

Research Procedures.

Fig. 3

Variable Important in PCA.

Fig. 4

Chi-Square Statistics by Variable.

Fig. 5

Mean Decrease Accuracy and Gini.

Fig. 6

Variables by Overall Importance.

Fig. 7

Variables Importance Clustering.

Fig. 8

Logistic Regression Coefficients and Significance.

Fig. 9

Sankey Diagram Analysis.

Fig. 10

Visualize Feature importance.

Table 1

Initial Dataset of the Construction Incident DB

Variable Type Elements Element Names Note
Weekday Categorical 7 Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday day of the week
Order Binary 2 Public, Private Order classification
ConTypeMa Categorical 3 Building, Civil engineering, Industrial facility Type of construction (Large, medium, small classification)
ConTypeMi Categorical 13 Building, Bridge, Others, Dam, Road, Industrial production facility, Water supply and sewerage, Retaining wall and excavation surface, Railroad, Tunnel, River, Port, Environmental facility
ConTypeSub Categorical 63→19 Apartment complex, Factory, Bridge, Educational and research facility, Neighborhood living facility, Others, Other building facilities, Detached house, Dam and berthing facility, Road and railway, Cultural and assembly facility, Slope and site development, Water supply, Sewage treatment and environmental facility, Lodging facility, Office facility, Retaining wall, Warehouse facility, Tunnel and subway
PersonAcc Categorical 23 Electrocution, Traffic accident, Others, Crush, Caught, Fall (Others), Fall (Obstruction), Fall (Slippage), Fall (Over 10m), Fall (Under 2m), Fall (2m to 3m), Fall (3m to 5m), Fall (5m to 10m), Fall (Unclassified), Hit by object, Collision, Unclassified, None, Cut and stab, Disease, Suffocation, Puncture, Burn personal accident classification
WorkClass Ma Categorical 7 Building, Mechanical equipment, Others, Industrial equipment, Electrical equipment, Civil engineering, Communication equipment Detailed type of construction (Large, medium, classification)
WorkClass Mi Categorical 39→20 Temporary construction, Building subsidiary construction, Pipe construction, Bridge construction, Metal construction, Others, Road and paving construction, Carpentry, Waterproofing, Electrical and industrial equipment construction, Masonry, Steel construction, Reinforced concrete construction, Railway and track construction, Tile and stone construction, Tunnel construction, Earthwork, Communication and mechanical equipment construction, River and port construction, Demolition and dismantling construction
AccObjectMa Categorical 9 Temporary structure, Construction tools, Construction machinery, Construction materials, Others, Components, Facilities, Disease, Soil and rock accident object (Large, medium, classification)
AccObjectMi Categorical 119→24 Temporary materials, Temporary structures, Opening, Construction machinery, Construction materials, High-altitude work vehicle, Tools, Bridge upper structure, Bridge lower structure, Crane (Mobile crane, etc.), Others, Support post, Rail, Components, Scattered materials, Slope, Safety facilities, Lifting materials, Pole and wire, Ground and buried materials, Disease, Windows, Tunnel, Waste
WorkProcess Categorical 58→19 High-altitude work, Others, Painting and finishing work, Finishing and organizing work, Blasting and excavation work, Laying and compaction work, Loading and unloading work, Installation work, Hoisting and loading work, Welding work, Transport work, Electrical and equipment work, Assembly work, Preparation work, Pouring and curing work, Piling and drilling work, Demolition and dismantling work, Formwork and carpentry work, Inspection and verification work Work Process
DamageCost Categorical 8 ~ 10 million won, 10 million won ~ 20 million won, 100 million won ~ 200 million won, 200 million won t~ 500 million won, 200 million won t~ 5 billion won, 500 million won ~ 1 billion won, 5 billion won ~, None Amount of damage
Provincial Categorical 18 Gangwon-do, Gangwon Special Self-Governing Province, Gyeonggi-do, Gyeongsangnam-do, Gyeongsangbuk-do, Gwangju Metropolitan City, Daegu Metropolitan City, Daejeon Metropolitan City, Busan Metropolitan City, Seoul Special City, Sejong Special Self-Governing City, Ulsan Metropolitan City, Incheon Metropolitan City, Jeollanam-do, Jeollabuk-do, Jeju Special Self-Governing Province, Chungcheongnam-do, Chungcheongbuk-do cities and provinces
Municipal Categorical 270 → 26 Gapyeong ~ Anseong, Geochang ~ Goseong, Gokseong ~ Yeosu, Gongju ~ Seocheon, Gwangju Metropolitan City, Gimcheon ~ Gyeongju, Gimpo ~ Pocheon, Damyang ~ Haenam, Daegu Metropolitan City, Daejeon Metropolitan City, Muju ~ Gunsan, Busan Metropolitan City, Bucheon ~ Pyeongtaek, Sangju ~ Pohang, Seoul Special City, Sejong Special Self-Governing City, Ulsan Metropolitan City, Wonju ~ Gangneung, Incheon Metropolitan City, Jangsu ~ Gochang, Jeju Special Self-Governing Province, Jincheon ~ Danyang, Changnyeong ~ Yangsan, Cheonan ~ Taean, Cheorwon ~ Yangyang, Cheongju ~ Yeongdong District, County
ConCost Categorical 18 -10 million won, 10-20 million won, 20–40 million won, 40 million-100 million won, 100 million-200 million won, 200–300 million won, 300–500 million won, 500 million-1 billion won, 1–2 billion won, 2–5 billion won, 5–10 billion won, 10–20 billion won, 20–50 billion won, 50–100 billion won, 100–150 billion won, 150–200 billion won, 200–300 billion won, 300–500 billion won, 500 billion won-, Unclassified Construction amount
BidRate Categorical 8 Less than 60%, 60–64%, 65–69%, 70–74%, 75–79%, 80–84%, 85–89%, 90% or more Bid Rate
ScheduleRate Categorical 10 Less than 10%, 10–19%, 20–29%, 30–39%, 40–49%, 50–59%, 60–69%, 70–79%, 80–89%, 90% Schedule Rate
Worker Categorical 6 Less than 19, 20–49, 50–99, 100–299, 300–499, More than 500 Number
ConAccident Categorical 3 Fatal accident, Injury accident, Property accident Type

Table 2

Selection of key variables

Variable PCA χ2 Feature Importance LCCA




Importance ChiSquare p_value log_p_value MDA MDG Overall Importance rf_Importance
Weekday 1.883 21.581 4.249e-02 1.371 0.642 125.068 −1.629 5.712910
Order 2.381 7.278 2.626e-02 1.580 26.341 32.235 −1.085 15.506522
ConTypeMa 2.521 36.309 2.498e-076. 602 11.181 17.198 −2.351 6.131618
ConTypeMi 2.477 87.721 3.424e-09 8.465 17.945 44.535 −1.507 10.583493
ConTypeSub 3.038 181.750 2.314e-21 20.635 27.645 124.533 0.255 8.467327
PersonAcc 2.692 15111.446 0 48.978 352.654 4.838 55.548442
WorkClassMa 2.840 38.266 1.388e-04 3.85722. 307 45.105 −1.193 6.958936
WorkClassMi 3.194 324.560 3.561e-47 46.448 18.225 121.465 −0.446 19.905119
AccObjectMa 2.664 1195.965 1.093e-244 243.961 34.135 141.637 0.941 39.532420
AccObjectMi 2.567 1874.469 0 35.364 126.029 0.816 31.231556
WorkProcess 3.044 390.228 4.853e-61 60.313 9.881 145.567 −0.704 13.371023
DamageCost 2.790 4066.939 0 59.030 135.840 2.607 36.877962
Provincial 2.239 80.977 1.049e-05 4.979 17.898 121.290 −0.471 6.179581
Municipal 2.070 81.745 3.065e-03 2.5135 16.005 144.726 -0.286 3.482259
ConCost 2.824 715.974 1.228e-128 127.910 38.458 171.976 1.655 41.611709
BidRate 3.128 57.702 2.954e-07 6.529 22.793 78.204 −0.711 3.443011
ScheduleRate 3.057 88.444 2.743e-11 10.561 15.952 139.301 −0.364 8.872325
Workers 2.711 300.588 1.167e-58 57.932 28.738 73.206 -0.362 19.930692

Table 3

Performance Inspection of General Machine Learning Model by Application of Factor Count

Variable Multinomial Logistic Regression k-NN(k=1) Decision Tree(party)



Category 24 21 18 15 24 21 18 15 24 21 18 15
Accuracy 0.9579 0.9581 0.9555 0.9519 0.9562 0.9557 0.9549 0.955 0.9571 0.9557 0.9555 0.9497

95% CI 0.9527
0.9627
0.9528
0.9629
0.9501
0.9605
0.9463
0.957
0.9508
0.9611
0.9503
0.9606
0.9495
0.9599
0.9496
0.96
0.9518, 0.962 0.9503, 0.9606 0.9502, 0.9605 0.944
0.9549

No Information Rate 0.9459 0.9459 0.9459 0.9459 0.9459 0.9459 0.9459 0.9459 0.9459 0.9446 0.9446 0.9446

P-Value 6.798e-06 5.159e-06 0.0002736 0.01725 0.0001107 0.0002194 0.0006395 0.0005199 2.554e-05 4.004e-05 5.111e-05 0.0401

Kappa 0.495 0.4969 0.4378 0.3726 0.4484 0.4301 0.4146 0.4364 0.5026 0.4903 0.4817 0.3411

Balanced Accuracy 0.7575 0.7535 0.7348 0.7096 0.6621 0.6502 0.6450 0.6593 0.7469 0.7144 0.7101 0.6770

Table 4

Performance Indicators of Supervised Learning Models

Variable Multinomial Logistic Regression k-NN (k=1) Decision Tree (party) SVM
Accuracy 0.9579 0.9562 0.9571 0.9516
95% CI (0.9527, 0.9627) (0.9508, 0.9611) (0.9518, 0.962) (0.946, 0.9567)
No Information Rate 0.9459 0.9459 0.9459 0.9459
P-Value 6.798e-06 0.0001107 2.554e-05 0.0228
Kappa 0.495 0.4484 0.5026 0.2304
Balanced Accuracy 0.7575 0.6621 0.7469 0.5496

Table 5

Performance Indicators of Ensemble Learning Models

Variable Bagging Random Forest XGBoost
Accuracy 0.953 0.9513 0.960
95% CI (0.9475, 0.9581) (0.9457, 0.9564) (0.9548, 0.9647)
No Information Rate 0.9459 0.9446 0.9452
P-Value 0.00588 0.0102 4.409e-08
Kappa 0.3019 0.3544 0.5099
Balanced Accuracy 0.5695 0.6749 0.7307

Table 6

Feature Importance Indicators

Gain The relative importance of each feature's contribution to the model's prediction
Cover The proportion of data split by the feature
Frequency The proportion of times the feature is used in splits
Importance Generally used as feature importance, such as Gain