Comparison of ML algorithm performances in evaluating human generated text summaries using Google Vertex AI
Date
2024
Journal Title
Journal ISSN
Volume Title
Publisher
Faculty of Science, University of Kelaniya Sri Lanka
Abstract
Text summary evaluation is a fundamental yet time-consuming task, and existing methodologies have sought to streamline this process. However, there remains a need for more effective and efficient approaches to address the challenges of summary quality evaluation. Machine learning (ML) algorithms, including ensemble methods, hold promise for automating summary evaluation, yet their comparative performance using Vertex AI embeddings remains underexplored. This study aims to analyze and compare these algorithms' effectiveness in predicting text summary quality, providing valuable insights for advancing natural language processing (NLP) summary evaluation. The dataset, comprising summaries written by students in grades 3-12 was provided by Commonlit. The dataset's validity was inspected through human evaluation before use. Embeddings from the "textembeddinggecko@001" model in Vertex AI were used to calculate cosine similarity, Flesch-Kincaid Grade, and Automated Readability Index for predicting summary quality. These scores, along with metrics like word count and text length, served as input features for various machine learning models. All experiments were conducted on a Vertex AI virtual machine (E2 configuration: 2 VCPUs, 16GB memory) under consistent computing conditions. This research utilizes embeddings to capture essential information from text summaries, subsequently feeding this data into various ML models. Through rigorous experimentation and analysis, the accuracy and efficiency of these models in predicting the content and wording scores of summaries were evaluated. Through a comparative analysis, the study evaluates the effectiveness and efficiency of these models in predicting the quality of text summaries. Results indicate that while accuracy in predicting content scores is relatively high, accuracy for wording scores remains a challenge due to the subjective nature of language. Notably, gradient boost regressor (GBR) and random forest regressor (RFR) emerge as the top-performing models in terms of accuracy with mean squared error (MSE) rates of 0.2533 and 0.2766 for content scores, while ridge and linear regressors demonstrate superior efficiency with average prediction times of 1.0299E-07s and 1.1580E- 07s respectively in predicting the same score. The GBR model was also human-evaluated with a few examples, and it was confirmed that it was accurate as the MSE scores suggest. The superior performance of GBR and RFR can be attributed to their ensemble nature, leveraging the collective wisdom of multiple decision trees to make more accurate predictions. Conversely, linear and ridge regressors demonstrate better efficiency in terms of computational resources and processing time, making them well-suited for scenarios where efficiency is paramount. The simplicity of linear models and their ability to generalize well to new data contribute to their efficiency, despite slightly lower accuracy compared to ensemble methods. However, the intricate workings of these models remain obscured by their black-box nature, highlighting the need for further research to elucidate underlying mechanisms and potential avenues for improvement. In conclusion, this study contributes valuable insights into the challenges and opportunities in text summary evaluation, paving the way for future advancements in the field.
Description
Keywords
Text summary evaluation, Algorithmic efficiency, Vertex AI, Embeddings
Citation
Jayawardene B. P. D.; Hewapathirana I. U. (2024), Comparison of ML algorithm performances in evaluating human generated text summaries using Google Vertex AI, Proceedings of the International Conference on Applied and Pure Sciences (ICAPS 2024-Kelaniya) Volume 4, Faculty of Science, University of Kelaniya Sri Lanka. Page 132