Compare Machine Learning Algorithms With Astronomical Dataset
DOI: https://doi.org/10.62517/jbdc.202401407
Author(s)
Puiwan Wang
Affiliation(s)
Fettes College, Edinburgh EH4 1QX, UK
Abstract
Stellar classification is the most important task in astronomy, with the goal of classifying all celestial objects based on their spectral properties to infer their physical characteristics. This paper aims to classify stars, galaxies, and quasars using some of the most popular machine learning algorithms, namely Logistic Regression, SVM, Random Forest, and Naïve Bayes, on the Sloan Digital Sky Survey Data Release 17 dataset. This followed a strict methodology that included various preprocessing steps: cleaning, normalization, handling outliers, and feature engineering to improve model performance. Each of the algorithms, during training, optimized its internal parameters to minimize any prediction errors and maximize reliability. Data were split into a training-test set; feature normalization was used for model stability. The metrics used for training algorithms included precision, recall, F1-score, and confusion matrices that provided a specific comparison of strengths and weaknesses for the algorithms. Random Forest emerged as the most effective classifier, achieving 98% accuracy due to its ability to handle complex patterns, while Logistic Regression and SVM delivered moderate performance with accuracies of 84% and 85%, respectively. Naïve Bayes, though computationally efficient, struggled with the dataset's complexity, achieving only 67% accuracy. This state-of-the-art research emphasizes how important the selection of algorithms is, given dataset characteristics, and classification objectives. This study also points to feature engineering and comprehensive evaluation for enhancing predictive reliability. By showcasing an efficient usage of machine learning for stellar classification, this work contributes to the development of automated analysis in astronomy and opens perspectives toward further improvements by enhanced feature selection, ensemble techniques, and larger datasets.
Keywords
Classification Stellar Classification, Machine Learning models, Feature Engineering, Predictive Accuracy, Sloan Digital Sky Survey (SDSS)
References
[1]Stellar classification, based on spectral absorption and emission patterns, reveals fundamental properties like temperature and chemical composition. (Source: Carroll, Bradley W., and Dale A. Ostlie. An Introduction to Modern Astrophysics, 2nd ed., Pearson, 2006.)
[2]Historical cataloguing, along with landmark discoveries like Andromeda's distance from the Milky Way, has shaped our understanding of the universe’s scale. (Source: Hubble, Edwin P., "A relation between distance and radial velocity among extra-galactic nebulae," Proceedings of the National Academy of Sciences, 1929.)
[3]Advances in telescope technology have allowed astronomers to classify distant celestial objects, providing insights into cosmic structure and evolution." (Source: Tyson, Neil DeGrasse.
[4]www.kaggle.com. (n.d.). Stellar Classification Dataset - SDSS17. [online] Available at: https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data.
[5]The concept of model generalization and the ability to predict on unseen data is a foundational aspect of supervised learning. See: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016, pp. 98-102.
[6]Optimization techniques, like gradient descent, are used to adjust model parameters to minimize prediction error iteratively. For details, refer to: Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016, pp. 199-201.
[7]Ensuring a model is generalized and robust is essential for reliable predictions in real-world applications. Refer to: Domingos, Pedro. "A few useful things to know about machine learning." Communications of the ACM, vol. 55, no. 10, 2012, pp. 78-87.
[8]Proper dataset splitting is essential for model evaluation and preventing overfitting. See: Kohavi, Ron. "A study of cross-validation and bootstrap for accuracy estimation and model selection." International Joint Conference on Artificial Intelligence, 1995, pp. 1137-1145.
[9]Feature engineering and dimensionality reduction methods like PCA (Principal Component Analysis) can enhance model performance, particularly for high-dimensional datasets. For a detailed look at PCA and feature selection, refer to: Jolliffe, I. T. Principal Component Analysis. Springer Series in Statistics, 2002.
[10]Cross-validation, specifically k-fold, is widely used to mitigate the variance due to specific data splits, enhancing the robustness and generalization of model evaluations. See: Kohavi, Ron. "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." IJCAI, 1995, pp. 1137-1143.
[11]Tuning parameters such as n_estimators and tree depth in Random Forests helps to prevent overfitting and improves model robustness. For further reading, see: Breiman, Leo. "Random forests." Machine Learning, vol. 45, no. 1, 2001, pp. 5-32.
[12]Incorporating cross-validation and hyperparameter tuning into feature engineering can significantly strengthen the reliability of machine learning models. See: Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012, pp. 250-255.
[13]Choosing the appropriate classification algorithm is crucial for achieving optimal performance based on the dataset characteristics. For a comprehensive review of classification algorithms, see: Hodge, V. J., and J. H. Austin. "A survey of outlier detection methodologies." Artificial Intelligence Review, vol. 22, no. 2, 2004, pp. 85-126.
[14]Improving feature selection can significantly enhance model performance, especially in complex datasets. See: Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." Journal of Machine Learning Research, vol. 3, 2003, pp. 1157-1182.