Predicting Sustainable Development Goals Using Course Descriptions - from LLMs to Conventional Foundation Models

We present our work on predicting United Nations sustainable development goals (SDG) for university courses. We use an LLM named PaLM 2 to generate training data given a noisy human-authored course description input as input. We use this data to train several different smaller language models to predict SDGs for university courses. This work contributes to better university level adaptation of SDGs. The best performing model in our experiments was BART with an F1-score of 0.786


I INTRODUCTION
The United Nations (UN) has established a list of 17 sustainable development goals (SDGs)1 .These goals are becoming more and more important in understanding the societal, humanitarian and environmental impact of companies in the EU given that certain large companies need to take rigorous sustainability reporting as part of their annual reporting to the authorities2 .
Because of the growing importance, many universities and educational institutions have started to adopt the UN SDGs as part of their academic curricula.This raises the important question of how an educational institution can, on a higher level, know which SDGs are being taught and where in the different degree programs.
Adopting local models that are adjusted based on their course descriptions helps universities to comply with GDPR3 and preserve data privacy.Additionally, by tailoring these models to the unique linguistic and curriculum quirks of the school, prediction accuracy can be increased while maintaining the security and confidentiality of sensitive data.
In our paper, we collect and clean a noisy course description dataset and use an LLM to generate SDGs for each course.We manually check and fix the LLM generated data that is used for testing.Furthermore, we fine-tune several smaller foundation models to predict SDGs based on course descriptions.Fine-tuning a smaller model makes the SDG prediction task faster and more cost-efficient.

II RELATED WORK
Sustainable development has been studied in the field of NLP from many different points of view such as studying fairness in NLP [Hessenthaler et al., 2022], studying poverty and societal sustainability in interviews [van Boven et al., 2022], argumentation mining [Fergadis et al., 2021] and community profiling [Conforti et al., 2020] among others.Our take differs from these in the sense that we aim to cover all UN sustainable development goals and apply them in a pedagogical context.
Perhaps the most similar prior work to ours is that of Amel-Zadeh et al. [2021].They used more traditional methods such as word2vec [Mikolov et al., 2013] and doc2vec [Le and Mikolov, 2014] to assess how well companies align with UN SDGs.They use a dictionary of SDG goal related terms to assess the overlap of each SDG with a given company.They then train a logistic classifier, an SVM and a fully connected neural network on the embeddings.Their finding was that a combination of doc2vec and SVM gave the best results.
In terms of the pedagogical context of our research, there is plenty of prior research on incorporating SDGs as part of teaching [Collazo Expósito and Granados Sánchez, 2020, Rajabifard et al., 2021, Kwee, 2021].This prior research is non-computational and to best of our knowledge, there is no prior research work on the topic from the NLP stand point.

III DATA
For our work, we gathered course information from Metropolia University of Applied Sciences over their API4 .This data retrieved was composed of 51,386 Finnish and English courses from the years 2004 to 2023.

Data Preprocessing
The dataset presented significant challenges in terms of variability and noise, attributed to the subjective nature of course descriptions provided by individual instructors.The length of these descriptions varied widely, and in some cases, the course objectives were either missing or contained redundant or extraneous information.
The study focused on courses offered between 2021 and 2023 to capture recent curricular trends.We imposed a character limit of 500 to 2000 for the combined length of course descriptions and objectives to maintain an optimal balance of detail and brevity.Courses outside this range were excluded.Moreover, the study was confined to courses conducted in English to maintain consistency in language processing.At this point, the data was composed of 8708 courses, with 103 unique disciplines.
Figure 1 depicts the distribution of English courses per the top 15 degrees after the initial cleaning step, which illustrates the diverse curricular offerings within the analyzed period.Notably, the 'Information and Communication Technology' discipline demonstrates a significantly higher volume of courses, underscoring the sector's expansion and its pivotal role in contemporary education landscapes.This visual representation also serves to highlight the curricular focus areas that are apparent within the institution, guiding the subsequent analysis stages to probe into the qualitative aspects of course content more deeply.
Our standardization process involved several steps, specifically the removal of entries with missing course descriptions or objectives, language detection using the Spacy NLP library [Honnibal The dataset, in its finalized state, consisted of 2125 courses in English, each defined by three key elements: name, description, and objective.

Generating SDGs
We use PaLM 2 [Anil et al., 2023] over Vertex AI API5 to generate the training data for the SDG prediction models.In particular, we use text-bison-32k from Model Garden6 .PaLM 2 is an LLM that takes in a prompt and produces an output based on the prompt in a similar fashion to ChatGPT [OpenAI, 2023].
In search of the most effective prompt, we employed the prompting IDE tool Prompterator [Sučik et al., 2023].Thus, to ensure the quality of the model's outputs, we took a small sample of data for batch processing and manually reviewed the model's responses using Prompterator.
This evaluation helped us confirm the appropriateness of the SDG predictions for subsequent training.Moreover, batch processing was instrumental in handling the dataset efficiently, allowing for the dynamic integration of each course's metadata into the prompt template.The responses collected from the model included the SDG goals deemed most relevant by the LLM, as shown in Table 1.
Our final prompt to the model was the following one appended with a course description of The purpose of selecting a lower temperature setting (0.2) was to limit the output variability of the model and promote accuracy.We discovered empirically that a token limit of 500 was adequate to enable the model to produce thorough answers without being overly verbose.

Data Preparation for Training
After SDG predictions were generated by the LLM, the dataset underwent a meticulous cleaning process.Initially, labels generated by the language model were extracted, stripping away the prompts included in the input.Subsequently, we refined the labels to solely represent SDG numbers, with a particular exclusion of Goal 4: Quality education.This goal was excluded because it was over-represented in the data -virtually every single university course contributes to quality education.
For compatibility with multi-label classification models, we encoded the list of SDGs relevant to each course into a binary format.Consequently, the dataset for model training comprised two components: the input-encompassing the course name, description, and objectives-and the output-a binary vector denoting pertinent SDGs.
An evaluation of the SDG distribution within the training data was conducted to ascertain dataset quality and representation balance, the results of which are depicted in Figure 2.
The percentage distribution of the model-generated SDG forecasts is shown in Figure 2.This figure omits Goal 4 (Quality Education) by design and reveals that certain goals, such as 2 (Zero Hunger), 14 (Life Below Water), and 15 (Life on Land), are less frequently associated with the course descriptions at Metropolia.This skewness in the data reflects the varied emphasis of SDGs in the actual course content.
After the quality of the dataset was confirmed, it was split 70:15:15 into subsets for training, validation, and testing.This allocation prevents overfitting during training and enables a thorough assessment of the model's performance across unknown data.

IV SDG PREDICTION MODELS
We try out fine-tuning several models for multi-class classification task using Transformers Python library [Wolf et al., 2020].We selected BERT [Devlin et al., 2019], mBERT [Devlin et al., 2019], RoBERTa [Liu et al., 2019], XLM-RoBERTa [Conneau et al., 2020], and BART [Lewis et al., 2019] due to their top-tier results in multi-label classification tasks.XLM-RoBERTa and mBERT, in particular, were chosen to explore theirs capabilities of a multilingual model for potential future purposes.
The models are trained to receive course information as input and are trained to predict the corresponding 3 most relevant SDG goals as output.This aligns with the format of our training data.An illustrative example of the input data and the model's expected output is presented in Table 2.This binary output represents the relevance of specific SDG goals to the given course information, with the example indicating relevance to goals 3, 5 and 8.

Field
Value Input "Clinical Practice, the student learns: Clinical Practice in nursing environment.Studentscan apply the theoretical and clinical competence required by the clinical practice environment to the nursing care of clients/patients-can maintain and promote the health of clients/patients and their significant others in a client-oriented way in nursing care-follow the ethical guidelines and principles of nursing-work responsibly as members of work groups and work community-can assess their professional competence and develop it further."Output [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] Training and evaluation of the models were carried out on Puhti, a Finnish research supercomputer provided by CSC -IT Center for Science, which facilitated the necessary computational Using the V100 GPU's large memory and parallel processing power, the models were trained on a single node to effectively handle our dataset.

V RESULTS
Understanding the efficacy of a given algorithm in the context of multi-class SDG classification depends on the evaluation of model performance.The models' performance was evaluated using precision, recall, and F1-score, which offer a thorough understanding of the models' capabilities-especially in light of the inherent class imbalance in our dataset.The performance metrics for each model are shown in the following  The table 3 shows BERT's precision of 0.765 and F1-score of 0.781 reflect its proficiency in categorizing instances correctly, demonstrating a reliable balance between precision and recall.
In contrast, BART outperforms other models with the highest F1-score, suggesting superior model efficacy due to its advanced pretraining methodology.The nuanced performance variations across the models underscore the significance of model selection tailored to specific NLP tasks' requirements.
The displayed F1 scores reveal that model performance fluctuates across the SDGs, with BART and mBERT often outperforming others, particularly in SDGs 7, 8, and 9.The lower F1 scores for SDGs 14,15 and 17 suggest that data imbalances pose challenges, affecting the models' ability to generalize effectively in these areas.Such patterns indicate the need for enhanced data strategies to address the imbalances and optimize model performance.

VI CONCLUSIONS
In this paper, we introduced a novel approach to predicting UN SDGs for university courses, employing PaLM 2 large language model to generate training data from course descriptions.Through the utilization of various smaller language models, we successfully trained models to predict SDGs for university courses.Notably, the best-performing model in our experiments was BART, achieving an F1-score of 0.786.
This research contributes to advancing the integration of SDGs at the university level, providing a valuable methodology for enhancing the adaptation of sustainable development principles in higher education.The findings open avenues for further research and implementation of similar approaches to foster sustainable practices in academic institutions worldwide.

Figure 1 :
Figure 1: Distribution of courses per degree after the initial cleaning step.

Figure 2 :
Figure 2: Distribution of SDG mentions within the training dataset.

Figure 3 :
Figure 3: F1 Scores by SDG for Each Model

Table 1 :
Final prompt specification used for SDG goal generation each course: Your goal is to identify UN SDG Goals relevant to students.Given a course name, the student learns: course content and course objective.Answer the question: What are the top few most relevant sustainable development goals to this course?Your task is to return only the numbers of the top few goals separated by commas.Also, never use the goal number 4.

Table 2 :
Example of course information input and the expected binary output for SDG prediction.
table, emphasizing their advantages and disadvantages for the given task.

Table 3 :
Models performance based on the micro scores