Abstract
Introduction
In recent years, information has proven to be a significant asset. Many organizations are ready to invest substantial funds and resources into procuring data of a certain type. Many global technology enterprises such as Google and Facebook depend on the revenue derived from their ability to display targeted advertisements to prospective viewers. Their ability to manage 1 and process large quantities of data is one of the fundamental strengths of these companies. Two of the most important and closely related technology fields that have emerged in recent decades are data mining and business intelligence (BI). Today, if businesses want to maximize their profits, utilizing the technology in these fields is imperative. Applying BI in any organization would be by building a new data access approach of the history transactions data, providing fully aggregated multidimensional visualized information that is an essential aid for decision makers. Data mining presents a similar approach but performs deeper analysis and processing of the organization’s historical data, which could reach the point of predicting potential customers for a certain service, for example. Using data mining algorithms that are designed wisely to study, 2 analyze, and examine the data interactions, data mining could provide nearly perfect anticipations of an organization’s performance depending on the type of information stored. Currently, many data mining algorithms are optimized for artificial intelligence 3 and machine learning developments. Educational organizations are also striving to enhance their services and improve their students’ performance, which can be through the use of data mining. Students have always been the focal point of all teaching and learning institutions. In the interest of better education and superior student performance, the Ministry of Education in Oman has continuously developed and utilized technology within the education process. There are multiple approaches regarding how to enhance education in general, for instance by studying successful educational systems in other countries or by implementing international standards. However, studying historical data presents a considerably easier, lower cost, and more dependable approach. Reviewing statistics has often proven efficient in large businesses. In addition, the data analysis field has grown vastly in recent years. Data mining algorithms are one of the repeatedly used methods for studying data to reveal information important for decision makers.
This article focuses on a data mining study on general education diploma students’ records. Students at this level have distinguished attention and concern from family and community due to this level’s importance as a final year in school and as a gateway to higher education.
This study offers several contributions: The study identifies and discusses many of the significant factors that influence students’ performance at grade 12 in Omani schools. The study successfully implements various predictive and descriptive data mining algorithms, and the results show high predictive ability of the models. The study provides a great deal of groundwork for future educational data mining (EDM) studies specifically in Oman for the Ministry of Education.
Literature review
Data mining is “the process of finding patterns from a large amount of data by applying some techniques.” 4 Data mining places great scientific attention to detail in large volumes of data. Its use of arithmetic calculations makes it robust and it precisely reveals information patterns from row data. Data mining results could add incredible value to educational organizations. “As a result, it would assist the educators in providing an effective teaching approach.” 5 “Big organizations use it primarily for finding new ways to increase their profits and to minimize cost.” 4 Data mining techniques include regression, classification, clustering, association rules, and time series analysis, as shown in Figure 1. Each approach contains a large quantity of algorithms and models that have been studied and modified over time by different researchers. Success rates differ depending on data and the circumstances of the problem.

Data mining categories.
Educational data mining
EDM is an emerging field that combines the power of data mining technologies with the education sector. EDM focuses on studying data from educational systems such as learning management systems. Prediction of a student’s performance and behavior is achieved using prediction algorithms that include classification, regression, and density estimation. 6 Clustering is done by grouping students with similar characteristics, such as interaction patterns and learning curves. Learning analytics is a field similar to EDM. Both of these fields have goals related to improvements in educational architectures overall, which also studies and analyzes educational data. 6
The EDM process consists of five important points, as illustrated in Figure 2. The first step is to collect the targeted raw educational data, which can be obtained from learning management systems, registration systems, and educational surveys; the second step is preprocessing the data to match the input format for the desired EDM method, this could include removing empty and null data and creating cubes if using online analytical processing for multidimensional data. The third step is to implement the EDM methods on the data prepared. In the fourth step, educational business questions should be answered depending on the results of the third step. The last step is to modify the education process accordingly, or to defer this step until after further investigation has been conducted that can provide more accurate results. 6

Outline of the EDM process (Adapted from Calvet et al. 6 ). EDM: educational data mining.
Classification
Classification is one of the most popularly applied approaches to data mining. Algorithms in this category aim to create classification models using the training data so that prediction of results falls into one of the classes estimated by the classification algorithm. Each algorithm has its own learning technique to detect data that are related to other data in the data set, which helps it build an accurate model that could predict values that are unknown, the popular classifiers are support vector machines, decision tree, neural networks, naive Bayes, and rule-based classifiers. 7 For each of these classifiers, there are a number of algorithms and versions that have been modified and enhanced over time. Researchers often apply multiple algorithms and compare results and accuracy because different algorithms work better in specific scenarios.
Parneet Kaur et al. 8 have applied multiple classifiers on a data set of 152 school students, including multilayer perceptron, naive Bayes classifiers, sequential minimal optimization, decision tree algorithm J48, and reduced error pruning decision tree. The study concluded with multilayer perceptron showing the best performance with 75% accuracy of prediction. Anuradha and Velmurugan 7 have targeted student results for undergraduates in their final year, and the data set used included student results of the semesters in college in addition to personal and precollege data collected using surveys. The study applied the J48 decision tree algorithm, naive Bayes classifiers, k-nearest neighbors algorithm, and two rule learners classification algorithms, OneR and JRip. The results of the study show that all of these methods performed well with an accuracy of above 60%, and most of the classifiers produced high accuracy rates for the average student’s performance. The exception to this was JRip, which predicted more accurately for students with distinctive performance, 7 in their work many input attributes of college student data for analysis were used, and the results of this study could be very beneficial for higher education managers wishing to enhance their education process.
Mustafa Agaoglu 9 targeted instructor’s performance using the course evaluation questionnaires usually completed by students at the end of each semester. He applied four different classification tactics and seven classification models, including decision tree algorithms (C5.0 and classification and regression trees), support vector machines, artificial neural networks (ANN; quick with two hidden layers (ANN-Q2 H), quick with three hidden layers (ANN-Q3 H), and multiple method (ANN-M)), and discriminant analysis. Table 1 presents the summarized results of this study. Although this study perfectly applied many of the classification algorithms with high accuracy rates, its data set was retrieved from university students’ evaluations of the courses and the instructors. It could be beneficial for universities nonetheless, as its attributes are limited to the questions included in the questionnaires and it excludes other significant values like the instructor’s experience, age, health condition, academic degree, and number of working hours.
Comparison of results of the seven classification algorithms. 9
ANN: artificial neural networks; DA: discriminant analysis; CART: classification and regression trees; SVM: support vector machines.
Regression modeling
Linear regression is one of the most used methods to comprehend algorithms. It represents the data as a linear graph and primarily shows the relationship between different variables. Usually, studies select a dependent variable or predictor that they focus their analysis on, and then they examine how other variables are related to the main variable. This provides an indication of strong, positive, weak, negative, or no relationship. Commonly, linear regression creates a straight line fitness of the data, but in some cases the analysis may not result in a straight line, which is read as a nonlinear relationship.
Linear regression can be categorized into two types: simple linear regression and complex linear regression. Simple linear regression is where one independent variable is tested against the dependent variable. It provides a direct analysis of a one-to-one relationship. However, more commonly multiple independent variables have a variant effect on the main variable. In this case, a complex linear regression approach is implemented. This approach measures how the combination of variables could form the predictor. In the opposite of linear regression is logistic regression, which is applied when a predicted variable is not continuous, for example, binary answers like yes/no, male/female, or married/bachelor. Since it is difficult to create a linear diagram using binary variables, the probability of the variable is used instead.
Burgos and his colleagues 10 have an interesting application of the logistic regression algorithm in modeling students’ performance to prevent dropouts. Their dataset consisted of 124 students registered in five courses with 12 graded activities to be completed within 20 weeks. The study aimed to predict dropout rates in early stages of the courses in order to apply an intervention plan to improve student performance and prevent students from dropping out, the predictor variable was either dropout or pass, logistic regression was applied instead of linear regression. The generated prediction models were applied to the students enrolled in the following academic year, over weeks 4, 7, and 10, since week 4 was a due date for activity 1 and week 10 was the courses’ halfway point. Students who were identified as potential dropouts based on the logistic regression model were contacted and informed by instructors as per the intervention plan, which resulted in a dropout rate of only 11% instead of the 26% seen the year before, showing about 14% improvement. 7 This study reflects the promising potential of the EDM field and how it enhances education processes. Many EDM studies do not reach the level where they solve real educational issues, provided that it is the sole objective of such studies. This study reflects the promising potential of the EDM field and how it enhances educational processes. Many EDM studies do not reach the level where they solve real educational issues, provided that it is the sole objective of such studies.
Depren et al. 11 conducted a study that involved a large data set and critically compared classification algorithms. In their research about the performance of EDM methods, the authors stated, “logistic regression outperforms other algorithms in terms of measuring classification performance.” The data set used by Serpil Kılıç Depren et al. 11 was obtained from the Trends in International Mathematics and Science Study records of 6250 eighth grade students in Turkey. It involved 11 factors related to student information and mathematics, and a binary dependent variable indicating mathematic achievement. The study tested multiple classification methods: two decision tree algorithms (random forest, and J48), a Bayesian network algorithm (naive Bayes), an ANN algorithm (multilayer perceptron), and the logistic regression algorithm. All algorithms performed above 74% for correct-classification ratio, with logistic regression showing the highest performance at 78.6%.
Clustering
Clustering is similar to classification but is an unsupervised approach to grouping similar elements of data together. The clustering algorithm needs to discover the clusters without any prior idea about the labels or the classes of information. 12 The algorithm measures the distance to the centroids in the cluster to decide on similarity between objects. 13 Clustering is more of an exploratory or descriptive approach that inspects related data. Although it can be applied in many fields related to EDM, there are some remarkable research areas where clustering can group study patterns and student behaviors, particularly on online learning services. 13 There are many different clustering algorithms such as K-means, hierarchical clustering, density-based spatial clustering of applications with noise, and self-organizing feature maps such as the unsupervised neural network technique. “K-means’ clustering algorithm is one of the most classic and widely used clustering algorithms.” 14 K-means clustering uses a K value as the number of required clusters and subsequently distributes the data set into K clusters. 13 It can easily cluster large data sets. 12 For such large data sets, however, it is difficult to guess how many clusters (K) should be given to the K-means clustering algorithm. The x-means algorithm is a modified K-means method that can automatically estimate the ideal number of clusters (K) that should be applied, based on Bayesian information criterion. 13
Study published in 2010 by Alex J Bowers 15 applied hierarchical clustering analysis to analyze a data set of 188 student marks over 12 years in two US schools in different districts. The result of this entire process was combined in a novel comprehensive cluster-gram that included a cluster-tree of the hierarchical clustering, a heat map of students’ subject marks, and categorical information including dropout rates, rates of taking the ACT 1 exam, gender, and district.
Singh et al. 16 applied the K-means clustering algorithm in research that aimed to understand and enhance student performance at the university level. The authors used a data set of 99 complete student records and applied K-means with different numbers of clusters each time. They then used a silhouette measure (a measurement of the similarity between objects in the same cluster) to select highest suitability that resulted as three clusters. 16 The results showed the number of students distributed through the three clusters, which could help university management obtain a better idea about student performance. Although this study was simple and concise, it could be improved by applying other data mining algorithms or by increasing the data set to include a broader number of students and/or a larger number of attributes.
Methodology
The secondary data collection method is applied in this research. This approach is excellent for data mining methods with a large data sets target. This research uses transactional educational data already collected and stored over the last 3 years. Using this type of data is highly beneficial but requires the organization’s permission and instructions for use, because formerly these data were gathered only for targeted applications. The data required for the research are scattered across the databases and stored in a way that it involves countless techniques to reshape and connect it in order to provide more meaning for the research. Larger data sets can provide rich and varied information that reflects real-world statistics and interactions. This is essential for organization improvement and for decision makers, as it can help in spotting irregularity and extracting key performance indicators.
The research methodology in this article is a simple flexible data mining reconstructed structure. It creates a mining structure view from raw data extracted from a database. It then applies data mining methods to reveal relevant information. Often, a great deal of feedback occurs during the execution of data mining. Therefore, each step of the framework (Figure 3) is able to return back to the previous step to enhance the results and ensure better performance of the data mining methods, which helps provide reliable information.

Research methodology framework.
Data planning
This project focuses on taking advantage of the vast data already collected and used by the Ministry of Education in Oman for other education purposes. The study is limited with only those variables that can be found within the Educational Portal data sets and thought to be closely related to the students’ performance at grade 12. The variables considered in the study are: Grade 12 student performance (main variable): the student’s grade in the last year of school. This is the sum of the marks from two semesters of work and exams, combined across all subjects. It is the overall percentage of a student’s performance in their 12th year. Student performance from the previous 2 years: percentage of all marks from all subjects in each year as a separate variable. Student performance in subjects from all 3 years: the cumulative mark for each subject that the student took in the last 3 years of school. Student attendance: calculated as the percentage of attendance in year 12. Age: the age of the student, calculated as the difference between the student’s birthdate and the academic year that their 12th grade courses began. School location: geographic location of the school. School performance: average 12th grade student performance in each school. Teacher performance: the average performance of all teachers for every 12th grade student. Teacher attendance: an average of attendance for all teachers who teach students at the 12th grade level. Class performance: this can include the average student performance in a class and the number of students. Gender Nationality
Classification
The decision tree method splits data into subgroups called nodes and leaf nodes. This method is fast and easy to understand, although considered among algorithms of slow learners. It is often combined with complex data mining methods to reduce the number of variables by selecting the best variables. Decision trees reflect how strongly related each node is to the parent node. The result of the decision tree shows the strongest variables at the upper level, followed by the next level until the tree reaches the leaf node. Variables can be used multiple times in the child nodes. Nodes usually represent a part/class of any variable within the input variables. Decision trees are known to deal with classes or discrete variables particularly in the output/predictor variable (e.g. marks of the student must be in classes such as A, B, C, D, F, or 0, 1, 2). For this research, students’ final year results at school is the criterion variable. Although this data is usually a collection of marks for each subject, it was converted into classes to be accepted by the decision tree method. See Figure 4 as an example of the decision tree basic structure.

Classification using decision tree example.
Experiments
The study has been implemented using the powerful Transact-SQL for data manipulation and SQL Server Analysis Services SSAS for data processing. SSAS has the ability to form cubes for online analytical processing and tabular multidimensional analytics. 17 It has an easy-to-use UI called SQL Server Data Tools that can be integrated with visual studio. This can help create data sources, views, dimensions, and data mining structures with multiple data mining models and support multiple data mining methods including classification, regression, clustering, association rules, time series, and others.
Data extraction
The data available are in the format of views constructed from joins of multiple tables. Some of these views do not have primary or foreign keys because they contain actual transactional data, for example, the results of one student in grade 10 would have 11 rows where each would represent a subject result. Some views are separated by year, but some combine all the data with an attribute that shows the year ID. Schools in Oman typically start in September and end in June. See Table 2 for details about the time dimension used within the data. Later in the study, year ID was mostly used to differentiate between the school years (Year ID contains all grades from 1 to 12).
Time dimension details applied in the study.
In this study, the data were organized to best fit all of the algorithms’ requirements. The data were also prepared in three stages, where each stage had a different selection of attributes. The first stage included all possible available attributes that an algorithm would accept in the data format. This stage helped highlight and explore the data available relating to students’ performance in grade 12. The second stage included the attributes of grade 10 and grade 11 only, with leaves of teachers and teacher visits combined for each year. This stage was very beneficial for predicting student performance before students even began year 12 and for checking how accurately student performance can be predicted. The third stage included data from all 3 years, but this was aggregated and considered only the total results of grade 10 and grade 11. This stage helped resolve the issue of students selecting different elective subjects.
Stage 1
For the classification process, decision tree: Bayesian Dirichlet equivalent with uniform prior, Bayesian with K2 prior, and entropy were used. 17 Bayesian Dirichlet equivalent is the default decision tree method. Both Bayesian Dirichlet equivalent and Bayesian with K2 prior produced very similar decision trees, while the entropy method produced a very large decision tree structure. Since the decision tree method prefers discrete predictors, the total results of students in year 12 were classed as A, B, C, D, F, as shown in Figure 5, representing 100–90, 89–80, 79–65, 64–50, and 49–0, respectively. The graph in Figure 6 represents the resulted decision tree with density highlighting the class ‘A’ students of year 12 result.

Decision tree histogram legend.

Stage 1 decision tree.
The decision tree method identified the five attributes that most affected a student’s total results in grade 12. These are (from highest to lowest): (1) total result of grade 11, (2) results of Islamic culture study in year 12, (3) results of applied mathematics in year 12, (4) results of social studies in year 12, and (5) results of Arabic language study in year 12. Decision tree showed an accuracy of up to 0.78 for the total population. Nonetheless, it showed very high accuracy for each class of total results for grade 12. The accuracy for the decision tree methods for individual classes A, B, C, D, and F are 0.98, 0.96, 0.95, 0.96, and 0.99, respectively. See Table 3 for clarifications of the variables appearing in the resulted decision trees.
Variables interpretations.
Stage 2
As mentioned previously, the data for the second stage only include attributes from years 10 and 11. This was done to eliminate year 12 attributes affecting the results and to produce a model that could predict year 12 results using only the data of the previous 2 years. The graph in Figure 7 indicates the decision tree for this stage with density highlighting class A of the total results of year 12.

Stage 2 decision tree.
The accuracy decreased to some extent from the previous stage, with an accuracy of 0.69 for this stage of total population. That is because, this stage has fewer attributes and does not include year 12 subjects compared with the previous stage. However, the prediction by total result classes is still surprisingly high. The accuracy scores of the decision tree methods for individual classes A, B, C, D, and F are 0.98, 0.93, 0.90, 0.90, and 0.91, respectively.
Stage 3
In this stage, only some selected attributes were used. These mostly represent the sum or average of many attributes. This stage includes the number of students in class and the average results of classmates for each year. Age, gender, town, and city are also included. This stage does not include individual subject results as it considers only total results of each year as continuous input, see Figure 8 for the resulted decision tree highlighting students of year 12 results of class “A.”

Stage 3 decision tree.
The total result of year 11 remains the most influential factor on the total result of year 12. The second most influential factor is the total result of year 10, then the number of classmates in year 12, and finally the average class result of year 12.
The accuracy is higher in this stage, with a score of 0.71 for total population prediction. The accuracy scores of the decision tree methods for individual classes A, B, C, D, and F are 0.97, 0.91, 0.92, 0.93, and 0.95, respectively, see Figure 9.

Decision tree lift chart for the mining structure (class (A) prediction accuracy).
In sum, the decision tree algorithms applied in this study resulted in various outcomes for each stage. In the first stage, which included all the possible attributes, the decision tree identified the five most important variables. These variables contain the total result of year 11 and four of the compulsory subjects of year 12. The prediction accuracy of the individual classes is above 0.94, which is much higher than the total population accuracy of 0.78. In the second stage, the subjects of year 12 were excluded from the attributes. The most important attributes are the total results of years 11 and 10, the result of Islamic studies in year 11, and the town of origin. This confirms the importance of the total results of years 10 and 11 regarding year 12 results. The prediction accuracy of individual classes is above 0.90 and the total is 0.69, which is lower than expected. Finally, the third stage eliminated all subject classes and included only aggregations from all 3 years. The total result of years 11 and 10 was again the most important variable, followed by the number of students in class and the average total result of classmates in year 12. The accuracy of the prediction for the total population is 0.71, which indicates that a lower number of students in class and a higher average class performance would greatly affect a student’s performance. The accuracy of predicting classes of students achieving A, B, C, D, or F remained above 0.90, which is almost identical to previous stages. See Table 4 for overall most resulted influential factors of the student’s performance.
Students’ performance most influential factors.
Discussion
The prediction of the students’ performance in the final/last year of school could help decision makers plan ahead and have more time to prepare for decisions. Many organizations in the country, such as higher education and training institutions, wait for the school graduates’ results before offering them suitable openings. Predictions with accepted accuracy have proven benefits in many fields, for example, in climate and weather conditions, and such predictions need to be utilized more in the education sector.
Education stakeholders could also benefit from how descriptive data mining techniques can summarize and show relationships between data. Bearing in mind these relationships could save time and effort in understanding the students’ data and how student performance is reflected. The research article outcomes propose many possible modifications for the education process. For example, since the average result of classmates has a significant impact on the results of individual students, elevating low achievers’ performance could be done by mixing low achievers with high achievers. Also, as per the study results, the total result from year 11 is a high positive indicator for year 12 performance. Therefore, schools, parents, and students themselves should begin to work together from year 11 to improve student performance. In reality, only subject results from year 12 are used to express the performance of general education diploma students; however, this study proved scientifically using data mining that previous years have a sizable impact on this performance, statistically speaking. Although this study involved about 6000 students and 3 years of their study records, the results are not conclusive for direct decision change at ministry level. Higher decisions have higher responsibility that forces for extensive testing and approval of outcomes and to include all possible inputs. This study results should be combined with other studies that implement primary data and different research methods to ensure accuracy and longtime effective enhancement strategies.
Conclusion
This research article started by introducing the data mining topic and discussing its benefits for education in Oman and particularly for students’ performance regarding the general education diploma. The second section reviewed the literature available from previous similar studies. There were several studies that have discussed EDM and applied data mining algorithms. Those studies have informed this research regarding the knowledge and skills needed for data mining application. However, most of these studies either have small input data volumes or use higher education students’ data. In the third section of this study, the methodology of the research was explained and the framework was presented. The fourth section discussed the tools used in the study and applied the classification algorithm and presented detailed illustrations of the decision tree results. Then lastly, discussion of the findings of the research and possible implications for the education sector in the Sultanate of Oman.
Data mining undoubtedly has a great deal of potential when applied appropriately; however, there are many challenges when it is applied to a large volume of data. It requires proper tools and the expertise of knowing how to use data mining properly, as well as the knowledge of how data are stored and related. Enhancing students’ performance is the key to better education processes and effective educational decision-making.
