The Impact of Software Team Project Measurements on Students' Performance in Software Engineering Education

It is essential to the software engineering instructors to monitor the students' performance in their course projects. Detecting key measures of software engineering project helps to get a better assessment for students' performance, resolve difficulties of low expectation-team's, and consequently improves the overall learning outcomes. Several studies attempted to present the important measures of software project but they only captured the early phases of the whole project time period. This paper introduces a hybrid approach of classification and feature selection techniques, which aims to comprehensively cover all phases of software development through investigating all product and process measures of software project. Experiments were conducted using five classifiers and two feature selection techniques. The results show the significant process and product measures for the software engineering team projects, which primarily improves the students' performance assessment. The performance prediction of our proposed assessment model outperforms prediction of the previous models.


Introduction
Software engineering (SE) is an important course to study the overall software development life cycle and increase the poor-quality of software (Guéhéneuc & Khomh 2019;Möller 2016;Standish Group 2009). Software engineering teams shall have specific skills to practice and learn the software process development. Several causes lead to software project failure linked to failures in teamwork aspects of software engineering, such as lack of experience, schedule surpasses, and globally distributed student teams, as in the studies by (Cuthbertson & Sauer 2003;Sauer et al. 2007;Daughtrey 2014;Reel 2009;Charette 2005;Duhigg 2016). Many factors primarily affect software engineering teams' learning process. So, measuring software engineering students' team activities during the classes is very important to assess their performance and eventually achieve better learning outcomes. The software engineering process measures involve certain team activities during the adaptation of good practices of software development processes like quality and time completeness of non-software deliverables such as website design, meetings participation, and documentation. While the software engineering product captures software deliverables (i.e., team outputs) issues such as user interface, performance, architecture, database design, code quality, and presentation for the project's final delivery to the instructors as described in SETAP project by (Lichman 2013). The previous studies only captured early design and implementation phases of software development. This study focuses on exploring the most critical measures for software engineering projects by assessing the team activities' learning capabilities to the students who participate in such projects.
Our approach lets the students expect their software engineering course performance and focus more on their weaknesses activities during the software project. The proposed method effectively improves the prediction of the grades for the software engineering students' projects. The paper makes the following main contributions: • Investigating the effectiveness of applying a hybrid assessment model on the all the time intervals of software project to improve the prediction performance of software engineering teams. • Finding essential product and process measures for each phase for the identification of low-expectation teams in software engineering education.

Related Work
Several approaches used several techniques in the educational environment to assess individual student performance by exploring the teaching classes' personal and quantitative issues such as grades, dropping frequency, teaching effectiveness, and e-learning techniques proposed by (Kotsiantis 2012;Lykourentzou, et al. 2009;Castro, et al. 2007;Baker & Yacef 2009;Baker 2002;Macfadyen & Dawson 2010;Delen 2010;Jovanovic et al. 2012;Hu et al., 2014;Guo et al. 2015). The research studies were conducted by (Petkovic et al. 2014;Petkovic et al. 2012;Petkovic et al. 2018) used Random Forest classifier as recommended by (Gomes, et al., 2017) to predict and assess software engineering teamwork rather than individual students by collecting the objective and quantitative data about team activity measures through a joint project among San Francisco State University (SFSU), Fulda University (Fulda) and Florida Atlantic University (FAU). They created a machine learning database that contains Figure 1. The proposed assessment model

Classification Phase
This phase includes the use of five machine learning classifiers: Bayesian Network, Decision Tree, Random Forest, Bagging, and Boosting to explore the effectiveness of all software process and project measures on students' learning process for each particular time interval. We briefly describe the five classifiers below: • Bayesian networks are directed acyclic graphs that are used to represent the joint probability distribution over a set of random variables. Each vertex in the graph represents a random variable, and edges represent the correlations between variables. The Bayesian network classifier returns the class label that maximizes the posterior probability as proposed by (Han et al. 2011). • Decision Trees construct a flowchart-like structure where each internal node represents an attribute, and each external node holds a class label. For a testing instance, a path is traced from the root to a leaf node to get the class prediction, as explained by (Thai-Nghe et al. 2011). • Random Forest classifier is an ensemble classification technique that works by generating many decision trees from bootstrap samples of the training data. Each tree predicts a class label for each input vector. The output of the classifier is selected by taking the majority voted class from all the decision trees in the forest as proposed in the study (Guo et al. 2015). • Bagging is an ensemble method that uses several training sets generated by a random draw with replacement. Each data set is used to train a model. The outputs of all models are combined to produce a single class label such used by (Friedman & Popescu 2003). • Boosting is the sequential learning of the predictors. The first predictive model learns from the whole data set, while the next learns from training sets based on the performance of the previous one. The weights of the misclassified examples increase, so they will have a higher probability of being included in the next predictor's training set, as concluded by (Friedman & Popescu 2003).

Feature Selection Phase
This phase employs two meta-heuristic nature-inspired techniques; evolutionary and PSO (Particle Swarm Optimization) search as features selection techniques to detect the most important measures. These techniques search in the space of solutions (or population) to find the optimal solution that represents the optimal features for student's performance assessment that help in getting better students' performance through the learning in software engineering classes. This paper uses evolutionary and PSO search techniques. The two techniques described as below: • Evolutionary Search is an iterative process in the form of generation to refine the population with fittest solutions using three main steps as in (De La Iglesia 2003;Namous et al. 2020). These steps include: (1) initializing a set of random solutions on a form of individual chromosomes, each representing the set of parameters of a problem. Each chromosome represents an array of features for evaluation in the Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.11, No.31, 2020 form of binary digits where each position identifies if a feature is selected (one) or not (zero); The strength of a chromosome is derived from applying an ML classifier as an evaluator on the ones' features in that chromosome where an ML evaluation metric is used as fitness value for that chromosome. (2) Two fittest chromosomes with the highest fitness values are selected as the elitism of the population (or fittest parent) to create new chromosomes (or children). (3) The process of creating the children is maintained in two main steps derived from nature with a random probability of using either; called crossover and mutation. The crossover process exchanges parts in the two chromosomes between each other, while the mutation process flips bits in a random position in the chromosome to generate a new child. The evolutionary search process then estimates the fitness values for the children and replaces them with non-relevant solutions.
• PSO Search is used for optimizing constrained and unconstrained tasks described by (Eberhart & Kennedy 1995). It depends on a set of particles that iteratively change their positions based on some stochastic process. This identifies the best fittest particle among all particles. Therefore, all particles aim to reach the location of the fittest particle. Also, each particle considers its best position during the search found by itself.

Dataset
The dataset includes over 115 measures of both process and product metrics which obtained from measuring the core activities of student teams through their final project in joint software engineering classes which applied at San Francisco State University, Fulda University and Florida Atlantic University. The software process measures related to non-software issues such: team participation, students' feedback, delivery of documentation, instructor intervention, cooperation concerns, and following proper software engineering practices. Also, the software product measures captured the software issues such: functionality, architectural design, code, database, and final product demo. The dataset includes 74 student teams with about 380 students. These data are combined into 11 different time intervals, the first five intervals measure requirement gathering (T1), design (T2), development (T3), testing (T4), and delivery (T5) phases correspondingly of Software Development Life Cycle (SDLC), while the remaining time intervals (T6-T11) are generated by aggregating various software phases from the main five time intervals (T1-T5). The semester is divided into five formally managed milestones, Ml through M5. The product and process measures are individually assigned to final grade as A (high expectation) or F (low expectation). Also, each software team project is assessed at 11-time intervals. This study focuses on predicting the class labeled F rather than the class labeled A.

Experiment Results
To evaluate the predictive models, we use the Recall, where it represents the percentage of positive tuples the classifier label as positive. There is a diversity in assessing students' performance during the time interval of SE activities. These activities come in individual and grouped tasks that result in a sense by merging the values of evaluation metrics. Figure 2 shows the results of using five machine learning techniques conducted on the product measures. The x-axis represents the time interval of eleven tasks; individual tasks (or T1-T5) and grouped tasks (or T6-T11), while the y-axis represents the Recall value for class F (under-estimation). For the product component and in terms of a block of SDLC, including the first five tasks, the Bayesian Network model (BN) outperforms the other ML techniques with effect Recall value at task five of 0.844 and average recall values over all other tasks. The conditional probability in BN effectively ties the relationship between the measures (or features) and the target, due to the low divergence in the data distribution. The entropy-based techniques manifest plausible results affected by the random distribution of the data with feasible Recall value to bagging technique using a decision tree as a base classifier. In contrast to the other time intervals (or grouped-based tasks), the entropy-based techniques leverage high performance to average Recall. This is because of the random skewness of the data distribution with a large variance that increases the entropy values and helps the entropy-based techniques to discretize the target label. Figure 3 also shows the Recall for the F class label for the different classifiers in all intervals (T1 to T11) for all the process measures. The results show that the Decision Tree classifier gives the best Recall in seven-time intervals while Bagging gives the best Recall in the remaining four intervals. The highest Recall is 0.68 that is obtained using Bagging for the T2 time interval. The expected reason to achieve the best Recall in the T2 time interval is that this time interval represents the detailed specifications phase where teams are expected to have a lot of communication and collaboration to complete this phase. Bayesian Networks gives the worst results in sixtime intervals while the Random Forest gives the worst results in four-time intervals. We also note that none of the classifiers obtain good results in the prediction of the F class labeled for time intervals T4 and T5 where T4 represents the beta launch milestone while T5 represents the final delivery milestone. The previous studies (Petkovic et al. 2014;Petkovic et al. 2012;Petkovic et al. 2018) lacked to present the classification on the remaining time intervals, focused only on T2 and T3, used only Random Forest classifier, and implemented stratified sampling in cross validation, since class labeled F is minority while the class labeled A is the majority as assigned in the dataset. Also, the study (Naseer et al. 2020) investigated only the software product measures for the first five intervals T1, T2, T3, T4, and T5 by implementing several different classifiers. On the other side, our proposed assessment model acts as comprehensive approach that covers all the time intervals from T1 until T11 for both of software product and process, and it demonstrates classification and feature selection techniques.  Table 1 summarizes the comparison between previous models and the proposed assessment model. The results were reported in the prior studies (Petkovic et al. 2014;Petkovic et al. 2012;Petkovic et al. 2018) presented the important measures for only the time intervals T2 and T3 of the whole project time period and they lack to cover the key measures for all 11 time intervals of the software project period. For software process measures at T2 as in the study (Petkovic et al. 2014), the best Recall is 66.7 % in their approach whereas it is 68% in our approach. Moreover, for software product measures at T3, the best Recall is 60% by using Random Forest classifier although it is 68.8 %. These significant findings indicate that our assessment model improves the prediction of the final grades for the low expectation SE teams. Furthermore, our results comprehensively report all process and product measure's for all the 11 time intervals of software project period. On the other words, the previous studies only captured early design and implementation phases of SDLC, while our approach captures all the phases of software development as well as the combination of these phases. This lead to get a better perceiving for the overall software development.
Besides, the study (Naseer et al. 2020) investigated only the software product measures for the first five intervals T1, T2, T3, T4, and T5 by implementing several different classifiers. The best Recall in their assessment model reported as 3.1 %, 15.6 %, 53.1 %, 78.1 %, and 75% for the time intervals T1, T2, T3, T4, and T5 respectively. On the other hand, our proposed assessment model outperforms the performance of their Recall values as specified in Table 1. The best Recall values in our findings are 65.6 %, 46.9 %, 68.8 %, 59.4 %, and 84.4 % for the time intervals T1, T2, T3, T4, and T5 correspondingly. In comparing with the previous studies These noteworthy findings indicate that our assessment model improves the prediction of the final grades for the low expectation SE teams. In addition, our study aims to propose an effective assessment model for predicting low expectation-teams (F-grade), it uses several machine learning classifiers on all process and product measures for each software team. Based on the best resulting Recall, evolutionary and PSO search are implemented to identify the most relevant measures for process and product individually. Knowing such software processes and product measures allow software engineering instructors to put more effort and attention to overcome the ambiguity and difficulties of low-expectation teams. Also, our approach lets the students to early expect their performance in software engineering course as well as they can keep attention to focus more on their weakness activities during the software project. Our study aims to improve the software projects learning which leads to a better software engineering education process. In respect to the feature selection phase, it is very essential to select the most important process and product measures to help instructors for assessing the final grades of low-expectation teams. Figure 4 and Figure 5 show the experimental results on selected measures extracted using evolutionary and PSO search compared to entire measures for the software product and process, respectively. The two techniques derived relevant measures with remarkable results compared with the overall measures in the product component, especially in tasks (T6 to T11) Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.11, No.31, 2020 13 having a measure reduction of approximately 95.7%. This indicates that there are some measures with high impacts on assessing the performance as shown in Figure 4. These measures are mainly linked to software product attributes, such as collecting all important activities that are handled by students in the code repository for each software team.   Vol.11, No.31, 2020 14 The results show that using only a small number of measures (1 measure to 27 measures) can give results almost similar to the results obtained by using all 85 measures. It is noteworthy that the two feature selection techniques capture the relevant features with good results compared with all measures with an average feature reduction of 87.6%. Since our goal is to predict how well the team applied the best software engineering practices, we would like to look deeply into the important measures identified by the feature selection techniques in the first time interval. The selected measures are related to software process attributes such as the total number of meeting hours and the total of in-person meeting hours. The results of this phase are described as in Table 2 and Table 3. Regarding the software product and process measures, the most important measures are related to tools logs, class data, and weekly time card surveys. The time intervals T4, T6, and T8 have the same product measures semesterId and helpHoursAverage. This refers to the mutual milestones among these time intervals during the software engineering team activities measurements.   (Petkovic et al. 2018) used GINI feature ranking to show the top 10 ranked process and product measures for time interval T2 and T3 respectively. The process measures of T2 were: lateissue Count, issue Count, standard Deviation Help Hours Total By Week average Help Hourstotal By Week, standard Deviation Help Hours Average By Week, coding Deliverables Hours Average, help Hours Standard Deviation, and standard Deviation MeetingHoursAverageByWeek. Furthermore, the study (Naseer et al. 2020) used J48 decision tree in sequential phases of assessments from T1 to T5 to select the software product features for the classification, they indicated few product measures as listed in their study. The proposed model detects the mutual measures after implementing evolutionary and PSO search as summarized in Table 2 and Table 3.
The common measures selected by our employed feature selection techniques and the feature selection technique employed by (Naseer et al. 2020)  Our feature selection technique detected many of the essential measures selected by previous work. Also, the proposed model detects many other important features that undetected by the previous studies (Petkovic, et al. 2018;Naseer et al. 2020). The literature proved that evolutionary and PSO search could select significant features. Therefore, the chosen measures can guide instructors to monitor low-expectation teams who are at risk and enable students to perceive the important activities in their projects that will affect their final grade.

Discussions
Performance prediction conveys major importance in educational software engineering and will lead to better academic learning outcomes for students and instructors. Software engineering education concentrates on monitoring the low-expectation teams to improve their skills and activities during the learning process. Many studies produced outcomes using different machine learning techniques to obtain better academic practices in educational environment by collecting educational dataset through sequential semesters. Software engineering education involves project-based learning task. The SE students work in group teams in several time intervals during the project period to eventually pass the course. Then, different phases of assessments help to monitor the performance of each team in different time intervals. Nonetheless, the prediction of low-expectation teams includes the using of machine learning techniques. The SETAP dataset is a rich educational dataset for software engineering education which enable the SE instructors to observe the performance of SE teams over 11 phases of assessments. A hybrid assessment model of classification and feature selection techniques was employed to comprehensively covered all the assessment phases to improve the performance prediction. The results discovered that it is possible to predict the performance of teams at all stages of evaluation to achieve better software engineering education with significant accuracy in the prediction of low-expectation teams which enable the SE instructors to identify their problems and learning difficulties.
Hence, to evaluate the best prediction, the proposed model was applied at the first level of assessment and sequentially until the eleventh level of assessment for both of product and process software measures. By considering the similar explored time intervals in the previous studies (Petkovic et al. 2018;Naseer et al. 2020), the results reveal significant predictions in comparing with the previous assessment models.
The performance prediction of low-expectation teams for software product measures is much better than for software process measures. This difference might refer to the large number of measures in both types, there are 115 product measures and 84 process measures. The software process measures related to non-software issues such: team participation, students' feedback, delivery of documentation, instructor intervention, cooperation concerns, and following proper software engineering practices.
Also, the software product measures captured the software issues such: functionality, architectural design, code, database, and final product demo. The assessment models (Petkovic et al. 2018;Naseer et al. 2020) are used to compare the proposed model to check its efficiency for the prediction of low-expectation teams. The proposed model outperformed other methods at assessment levels two, three, fourth, and five. At level five, the proposed model achieved the best Recall in predicting the low-expectation teams. The identification of learning difficulties of low-expectation teams considered as an important concern for software engineering education. The proposed assessment model helps to achieve the learning objectives by improving the activities for software engineering teams. The prediction for software product and process development can help instructors detect low-expectation teams, and allow instructors and students to perceive the final grades expectation. This proposed model improves the software project assessment and enables for moving toward the success of software engineering courses in software engineering education.

Threats to validity
The main challenges for this study are: 1) the dataset has small size since it only contains 74 student teams, 2) the dataset is unbalanced since the class labeled F occurrences are few comparing with the large number records of class labeled A since the focus of our study to predict the class labeled F (low-expectation) rather than the class labeled (high-expectation), and 3) to estimate the Recall in the presence of unbalanced training data, this study lacks to use any sampling type in cross validation. We believe that our performance prediction results get improved if we use sampling to increase the occurrences of class labeled F in the dataset.

Conclusion
This paper proposes a hybrid assessment model for evaluating the students' performance in software engineering Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.11, No.31, 2020 projects. Unlike the previous approaches which studies only design and implementation phases, this study captures all the phases of software development and detect the essential measures for each phase. It effectively employs several machine learning classifiers and feature selection techniques to cover all the time intervals of software project period. Experiments are conducted on a rich academic dataset which has been collected through joint software engineering classes among three universities. The Recall measure is applied to show the accuracy of low expectation-teams across over 11 sequential time intervals and based on the predefined criterion, which has been determined by assigned instructors. The results have shown a significant process and product measures for the software engineering team projects, which mainly affects the students' performance assessment. The proposed model helps software engineering instructors to assess the students by highlighting the essential software measures for each specific time interval of software project duration. It also improves the students' learning capabilities during software projects to concentrate on their final grades' vital software measurements. Future directions include applying deep learning techniques to get a new direction in the software engineering education environment.