Machine learning has emerged as an effective approach for predicting software defects in the early stages of developing complex systems. In this study, historical project statistics and code metrics were collected and preprocessed, while initial software class labels were generated using K-means clustering. Several classification algorithms, including Support Vector Machine (SVM), Random Forest, Naive Bayes, and ensemble methods, were then trained and evaluated. The experimental results demonstrated that Random Forest and other ensemble-based models outperformed the alternatives, with some models achieving prediction accuracies close to 99%. Furthermore, the findings suggest that applying machine learning in this domain can significantly reduce software testing time and cost while improving the overall quality of the final product. Nevertheless, challenges such as insufficient labeled data and dataset imbalance still limit broader practical adoption. To address these issues, the study recommends the use of synthetic data generation, resampling and balancing techniques, and advanced deep learning models for large-scale industrial applications.
Article Link: