Massification in Universities: Are Assessment Tools Still Reliable? A Reflection from Sokoine University of Agriculture, Tanzania

A tremendous increase of the number of students in universities has been experienced by almost every country all over the world including Tanzania. The Increasing number of students has greatly affected the instructors’ workload and general practices of student’s assessment and evaluation. This study aimed at determining the reliability of the assessment tools at Sokoine University of Agriculture. Retrospective record review was done on education undergraduate students who sat for an EDP 100 in 2014/2015, 2015/2016 and 2017/2018 academic years where the course was selected through random procedures. A total of 214 scripts were systematically randomly sampled from each cohort. The results revealed a drop in internal consistency of the scores obtained from EDP 100 course across the three cohorts. Majority of the questions for the EDP 100 though were moderately difficulty, their discrimination powers were poor. However, the variation in difficulty and discrimination indices for the three cohorts was statistically not significant (p˃0.05 for MCQ and MIQ) except the discrimination index for MIQ which shows significant variations (p˂0.05). It is therefore recommended that similar studies should be done to determine both validity and reliability of the assessment tools for the other subjects at the University.


Introduction
A tremendous increase of the number of students in universities has been experienced by almost every country all over the world. While, the global universities' enrolment has risen from 13.8% in 1990 to 29% in 2010, Sub-Saharan Africa has experienced a doubling of gross enrolment ratios from 3% in 1990 to 7% in 2010 (Hornsby & Osman, 2014). In Tanzania, the situation has become more evident in the recent past (Kapinga & Amani, 2016). According to Memba & Feng (2016), students' enrolments in Tanzanian universities increased from 98,915 to 354,430 between 2008/2009 and 2015/2016 academic years, respectively. Sokoine University of Agriculture which is one of the public universities in Tanzania was established in 1 st July, 1984(Sokoine University of Agriculture, 2007. Since its establishment, the university has also been experiencing the massive increase of the number of student just like other universities in the country. For example, the number of students raised almost four times from 2729 in 2008/2009 to 8296 in 2016/2017 academic years. Following this increase in number of students in universities, the instructor-student ratio has been greatly affected leading to ineffective provision of quality teaching and student assessments (Ntim, 2016). Large classes in education institutions affect much the interaction among instructors and students. Increase in numbers of students lead to poor communications among instructors with their students and the general practices of designing and using appropriate assessment tools (Alomari & Akour, 2014). Large classes hinder instructors to organize quizzes and regular class tests resulting into inefficient assessment of teaching and learning process (Yelkpieri, Namale, Esia-donkoh & Ofosu-dwamena, 2012). The increase in number of students in any education institution has turned the normal way of conducting assessment among students in universities. Regardless of the increasing number, universities would wish to maintain the quality of the programs offered. One of the means of maintaining quality of training is through effective evaluation of teaching and learning process. Effective evaluation requires valid and reliable assessment tools. Therefore, the need to check for internal consistency of the assessment tools used for teaching and learning in Tanzanian universities is one of the important aspects for effective assessment.

Statement of the Problem
Increasing number of students in universities which does not equally match recruitment of instructors has greatly affected the instructors' workload. With large classes, tutorials and practical sessions which were considered to be important element of learning has been replaced by examination papers or reports (Mohamedbhai, 2008). Examinations have been held more frequently and lecturers often repeat the same exams papers to different groups of students (Mohamedbhai, 2008). Furthermore, the nature of examinations questions have also changed greatly as most lecturers prefer multiple choices and short answers questions which are easier to mark and serve time (Chan, 2010). These objective questions are not necessarily bad as research shows that they can cover a wide range of content taught compared to essay questions. Also, such questions can measure even higher cognitive levels of learning when carefully constructed (Scully, 2017). Therefore, considering the situation of instructors' workload due to increased number of student's enrolment, one may not be certain on the attention required to ensure effectiveness in both teaching and assessment. This is the reason why determining the consistency of the tools used by instructors in assessing student learning outcomes created the desire for conduction of this study.

Objectives of the Study
The main objective of this study was to explore the reliability of the university examinations across years as the numbers of student's increases. Specifically, the study intended to: i. Examine the internal consistency of the introduction to Educational Psychology (EDP 100) University examinations across three years at Sokoine University of Agriculture. ii.
Assess the difficulty and discrimination indices of the introduction to Educational Psychology (EDP 100) University Examination items across three years at Sokoine university of Agriculture. iii.
To determine whether difficulty and discrimination indices vary significantly across years.

Research Questions i.
What are the average values of the internal consistency of the EDP 100 University examinations at Sokoine University of Agriculture for a period of three years? ii.
What are the average values of difficulty and discrimination indices of the examination items used in EDP 100 across three years at Sokoine University of Agriculture? iii.
Do difficulty and discrimination indices vary significantly across years?

Reliability of an Assessment Tool
Assessment of learning outcomes in any education institutions is a crucial thing due to its diagnosis role, improving teaching process and student leaning (Tremblay, Lalancette & Roseveare, 2012). Assessments are acknowledged as the most powerful educational tools for promoting effective student learning and that is what instructors can do to help their students to learn (Rahman & Majumder, 2014). The assessment of learning outcomes involves various assessment tools that have been used. Assessment of learning outcomes in the classes and in education institution in general can be achieved through the use of various types of testy items and techniques such as multiple choice, short answer response, true or false, essay questions, portfolio, tutorial, practical, observation, checklist, anecdotal, assignment and projects (Miller, Linn & Gronlund, 2009;Omari, 2006). In order for these assessment tools to be valid, they must also be reliable so as bring out the desired outcomes. Assessment tool reliability is concerned with the ability of a tool to measure consistently the desired learning outcomes (Tavakol & Dennick, 2011). It is the consistency of a measurement (Miller, Linn, & Gronlund, 2009). Reliable assessment tool should ensure that test scores are stable and free from measurement errors (Ghazali, 2016). Reliability exists in several forms such as test-retest, inter-rater, equivalent forms and internal consistency (Ursachi, Horodnic & Zait, 2015); (Oliveira et al., 2016). While, test-retest checks what happens with instrument in time by the assumption that there are no substantial changes in the construct being measured between two different occasions, inter-rater tells about the consistency of different investigators to obtain the same results using the same tool (Ursachi, Horodnic & Zait, 2015). Equivalent form involves the concurrent administration of two parallel or alternate forms of the assessment tool to the same students and obtains the correlation coefficient (Ajayi, 2013). Internal consistency reliability evaluates the consistency of results across factors within a test (Hajjar, 2018). It indicates whether items on a test that are intended measure the same construct and produces consistent score (Tang, Cui & Babenko, 2014). Furthermore, the examinations of individual scale items for deviation from particular factors are ensured by internal consistency (Harms & Biocca, 2004). Reliability of an assessment tools are affected by various factors. Zhu and Han (2011) observed three factors that affect the reliability of the test. Firstly, change of candidates and testing process. This is attributed by either the change of true score due to change of candidates language ability or misleading test results of which is due to affected real language level of a candidate. Secondly, testing features; these include things like the length and the difficulty of the test paper. The longer the paper always shows more reliability than shorter ones. This is due to the fact that, the more the contents are in the paper, the bigger scale there is in it. It follows that, if there are more representative content in the paper, the reliability of the paper will be more complete. Also the degree of testing difficulty and division will also affect test reliability. This is due to the fact that, if in the test there are questions that are either very difficult or very easy, the reliability of the test will be influenced by both aspects. Thirdly, methods of going over the test paper of which are influenced by mistakes during the process of going over the test paper tend to lower the reliability of the test. Objective questions do not require any subjective judgment so that it can achieve high reliability. This is contrary to subjective questions that need people's subjective judgment and hence affect the reliability of the paper. Also, Kinyua and Okunya (2014) added that, the improper use of bloom's taxonomy in test construction, ambiguity of the test items and poorly written questions prompt students guessing, and hence in turn tend to lower the reliability of the test paper. Furthermore, reliability of the items may be affected by expressions attributed by insufficient information and use of terms that led to misunderstanding and biases in composing the question items (Ercan, Yazici & Sigirli, 2007).

Difficulty and discrimination Indices of the Test Item
Educators perform what is called item analysis after administering an examination on students (Khoshalm & Rashid, 2016). Item analysis examines student's responses to individual test item question in order to assess the quality of those items and of the test as whole (Khoshalm & Rashid, 2016). Item analysis focuses to identify the item problematic. According to Varma (2014) as cited by Adegoke (2014), poorly written items; pictures, graphs and diagrams or lack of clear information may lead to absence of the correct response on the test item, item containing default distracters and bias for or against ethnic groups constitutes the reason for item problematic response that must be resolved. Difficulty and discrimination indices are among the parameters in item analysis that ensure standards of items in examinations (Pande, Pande & Parate, 2013). According to Aron (2006) as cited in Johari et al., (2011), difficulty index serves four purposes. Firstly, it identifies the concept that needs to be taught again, upon discovering that students cannot answer some particular questions. Secondly, identification and reporting the strengths and weaknesses of curriculum parts, which can and cannot be dominated by students. Thirdly, giving feedback to students regarding their strengths and weaknesses on topics assessed; and finally, identification of the questions that are content biased, like the contents that may have been highlighted during the teaching sessions. Thus, difficulty index is crucial for all educators, regardless of their level. On the other hand, discrimination index compares the number of people with high test scores who answered the item correctly with the number of people with low scores who answered the same item correctly. This index is considered as a basic indicator of an item quality (Cornachione, 2005). Difficulty and discrimination indices of the test item are affected by some factors. According to Olatunji (2009), Oyejide (1991) and Mehrens and Lehmann (1973) as cited in Ngung'u (2015), three factors affects the difficulty and discrimination indices of an item. Firstly, number of objectives indicating that the numbers of options provided in the test have either a positive or negative effects on both indices. Secondly, student level of understanding on a particular concept which might lead to inappropriate response. Thirdly, teachers training on the item development of which is necessary in enhancing well formulated questions. In support of these factors, Sung, Lin and Hung (2015) pointed out that, phonetic discrimination, number of plausible distracters, heterogeneity of sentence patterns in options, necessity for inference, lexical overlap, content familiarity, redundancy of necessary information have influence on difficulty and discrimination level of the test items.
Item analysis studies for a long time have been conducted worldwide to check for the reliabilities of the assessment tools in education institutions. In Malaysia, a retrospective study was done to reveal the competency assessment to medical undergraduate students who had undertaken the end of posting examinations after completing pediatric rotation. The study involved two cohorts and the results showed that the difficulty and discrimination level of multiple choice and long case questions were varying (Taib & Yusoff, 2014). Similarly, about 50% only of the test items for research in teaching beginners among music students in public universities in Malaysia were reported to have moderate difficulty and discrimination indices through the use of Kuder Richardson 20 and 21 (Sabri, 2013). Study done by Mukherjee and Lahiri (2015) reported on the acceptance of multiple choice questions to be used for further assessment after attaining the p-value of 20% to 90% and discrimination index of ≥ 0.3 in a medical college of Kolkata in Bengal, India. On the other hand, the test items analysis for an achievement test in the history subject to Indian standard 11 th , led to rejection of some of the items (Gowdhaman & Nachimuthu, 2013). Furthermore, the internal consistency and reliability of the networked minds as a measure of social pretence were studied in Nepal. The results obtained using cronbach alpha indicated that, the subscales factors were consistent (Harms & Biocca, 2004). According to Boopathiraj and Chellamani (2013) analysis of researcher made test items is of great importance as some of the items made by postgraduate students in Tamilnadu were found to be defective in both difficulty and discrimination levels. In Indonesia, the analysis of the difficulty level of the subjective English test was done during the mid semester, 2013 at SMA Negeri 1 Pendole where students answer sheets of the tenth grade were used during the data collection. The findings showed that, most of the test items were moderately easier and the test made by the teachers could be qualified in good test (Lebagi & Darmawan, 2014).
Study by Adegoke (2014) on the role of item analysis in detecting and improving faulty of physics objective test items involved 900 sample sizes among senior secondary school was conducted in ibadan, Nigeria. The results showed that some of the items were extremely easy and some failed to discriminate among higher and lower achievers. In South Africa, a quality assurance study for tools used in assessment was conducted. The study investigated the difficulty and discrimination ability of examinations in an undergraduate pharmacy programme at Medunsa campus of the University of Limpopo. The difficulty and discrimination indices were calculated for each True/False and constructed response questions from a total of 15 summative examinations in 1 st and 4 th year's level. The results found that most of the items had an acceptable level of difficulty and discrimination indices, though some were detected to have some weaknesses. They finally recommended more educators to carry out more items analysis for their test-writing and communicate their finding (Fourie, Summers & Zweygarth, 2010). In Kenya, the study on item analysis concentrated on the investigation of factors affecting reliability, difficulty and discrimination indices of science test items in commercial paper to class eight students (Ngung'u, 2015). In Tanzania, studies have reported on little capacity of secondary school teachers in assessment practices including computing both difficulty and discrimination indices for their constructed test items (Byabato & Kisamo, 2014). On the other hand, HakiElimu (2012) reported on the reliability of the examinations administered by the national examination council of Tanzania to secondary school students.
In the light of what has been pointed out in the consulted literature, it can be argued that, more on the item analysis in different examinations concentrated on intensively lower levels of education. Furthermore, the analysis practices in higher education has been observed more on formative assessment and little on summative assessment in south Africa. In Tanzania, through national examination council of Tanzania, there is evidence of little capacity by teachers on assessment practices in secondary schools and less is known in higher learning institutions. Thus, this study aimed at determining the reliability of the assessment tools in Tanzanian Universities.

Methodology
The target population comprised of first year undergraduate education students at Solomon Mahlangu College of Science and Education, Sokoine University of Agriculture (SUA). The college has five departments where the department of education was randomly selected. Also, the researchers selected one of the courses from the sampled department randomly where Introduction to Education Psychology (EDP 100) was selected. Retrospective record review was done on education undergraduate students who sat for an EDP 100 in 2014/2015, 2015/2016 and 2017/2018 academic year. 214 scripts were systematically randomly sampled from each academic year and hence a total of 642 scripts were obtained and include in the study.

Introduction to Educational Psychology university format and Assessment Components
Each of the university examination contained multiple choice (MCQ), matching items (MIQ) and essay type questions (ETQ) as assessment tools for summative evaluation developed from the department of education. Before end of the Semester University examination administration students were required to participate fully during the lectures and take all the required continuous assessment tests, seminars and assignments as part of formative assessment. The MCQ, MIQ and ETQ were constructed by the course instructors and later moderated by the department of education. MCQ and MIQ were found to range from 17 to 30 of which, the first 17 or 20 questions were included in the study. ETQ were found to be two or three of which they were all included in the study.

Statistical Analysis
Three statistical tests were done namely internal consistency reliability coefficient, difficulty index and discrimination index. The calculations were done on excel sheet. Determination of Internal consistency reliability coefficient: This was measured by split-half method as proposed by (Boyle, 2017). According to Webb, Shavelson and Haertel (2006) split-half method is done by dividing the test items into half parts and a host of split-half reliability coefficients is derived. The correlation between two halves, which is odd score (X) and even score (Y), was estimated by Pearson product moment correlation formula shown in equation 1 (Mukaka, 2012).
Whereby: r= correlation coefficient of a half length test, X= odd score, Y= even score, = mean of X scores, = mean of Y scores Furthermore, the reliability of full length test was pressed using the spearman-Brown formula shown in equation 2 as adopted from (Webb et al., 2006). Whereby: = reliability coefficient of a full length test, r = correlation coefficient of a half length test. Difficulty index (P): This is known as p-value or easy value describing the percentages of students who Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.12, No.23, 2021 54 correctly answered the item, and that it ranges from 0 to 100% or 0 to 1 (Hingorjo & Jaleel, 2012). According to Bichi (2015), difficulty index is denoted as P and for objective questions is symbolically given as; Whereby: P= difficulty index, R= number of examinees who get that item correctly, N= Total number of examinee who sat for a test. The extension of this formula was given by Boopathiraj and Chellamani (2013) who stated that; Whereby: P = difficulty index, Ru= number of students in the upper group who responded correctly, Rl = number of students in the lower group who responded correctly, Nu = Numbers of students in the upper group, Nl = Number of students in the lower group. On the other hand, the difficulty index for subjective questions is expressed according to Nitko (2004) of which a formula was given as; Difficulty index determines the difficulty levels in examinations questions by classifying into easy, moderate and hard (Johari et al., 2011). Difficulty index further compares the difficulty in answering the same examination question by a group of students (Johari et al., 2011). Test items are classified as easy, moderate difficulty or difficulty if their difficulty indices are ˃70, 0.31≤0.7 or ≤0.3, respectively (Bichi, 2015) Discrimination index (D): Refer to the ability of an item to distinguish high and low scoring learners (Koçdar, Karadag & Sahin, 2016). According to Fourie et al., (2010), discrimination index is denoted as D and is mathematically expressed as; Whereby: D= discrimination index, Ru= number of students in the upper group who responded correctly, Rl = number of students in the lower group who responded correctly, Nu = Numbers of students in the upper group, Nl = Number of students in the lower group. The procedure followed was adapted from Boopathiraj and Chellamani (2013) as follows; i. The sample test papers were obtained from the administered university examinations of which, 214 examination papers from each cohort were drawn through systematic random sampling. ii.
The upper 27% and lower 27% examinees were obtained with the highest and lowest ranking order respectively on the total test of which, 58 examinees were obtained for each upper and lower groups from all three cohorts. iii.
Calculations for each item were done correctly using the relevant formula as shown above. The index ranges from -1.00 to +1.00 and classified as satisfactory, acceptable, marginal or poor items if their discrimination indices are ≥0.40, 0.30 to ≤0.39, 0.2 to ≤0.29 or ≤0.2, respectively (Bichi, 2015;Hingorjo & Jaleel, 2012). This index reflects the degree to which an item and the test as a whole are measuring a unitary ability and thus, values of the coefficient will tend to be lower for tests measuring a wide range of content area than more homogenous tests (Quaigrain & Arhin, 2017). Furthermore, it is expected that the higher performing students selects the correct answer for each item more often than the lower performing students (Hingorjo & Jaleel, 2012). Positive discrimination index (0.00 to +1.00) and negative discrimination index (-1.00 to 0.00), entails that higher achievers got correct answer for specific item more than lower achievers and vice versa, respectively (Hingorjo & Jaleel, 2012). Table 1 shows the correlation coefficient r, expressing a reliability of an introduction to educational psychology (EDP 100) university examinations administered to all first years students who are pursuing a bachelor degree of science with education at Solomon Mahlangu College of Science and Education, Sokoine University of Agriculture. The results showed that, internal consistency reliabilities of the three examinations ranged from 0.49 to 0.74. Typical classroom test displays an internal consistency reliability of between 0.60 and 0.80, implying that, on average of between 20% and 40% of the variations in the students' scores is a result of measurement of errors (Blerkom, 2009). Results in table 1 showed that, the correlation coefficients of the examinations administered during the 2014/2015 and 2015/2016 academic years had the internal consistency values that fall within the provided limits and hence are considered to have acceptable reliabilities. On the other hand, the Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.12, No.23, 2021 55 reliability coefficient of the exams administered during the 2017/2018 academic year had low internal consistency compared to the proposed limits. Furthermore, the results shown in table 1 indicated the decrease in reliabilities across the three years. Sokoine University of Agriculture has experienced a substantial increase of students in almost all the offered programs including bachelor degrees of education since its establishment. The decrease in reliabilities as observed can be attributed with this increase in number of students. Associating these results with increase in the number of students is supported with the observation that class size has an effect on the test reliability (Alomari & Akour, 2014). Blerkom, (2009) specifies that lack of time for teachers and their incompetency during classroom test construction tends to lower the reliability of the test. Therefore, it can be established that the increase of the numbers of students in universities overloads instructors and hence run shortage of time for proper test construction.  (Bichi, 2015). Thus, from the table, MCQ 4 and 6 were easy, 1,2, 3,7,8,9,11,12,13,14,16,17,18, and 20 were moderate difficulty while, 5,10,15 and 19 were considered as difficulty items in 2014/2015, MCQ 3,9 and20 were easy items, 1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18 and19 were moderately difficulty items and 4 was considered as difficulty items during the 2015/2016 academic years, while MCQ 1,4,5,7,8,9,10,11,12,13,14,15 and 16 were moderately difficulty items , MCQ 2,3 and 6 were difficulty items and there were no easy items during the 2017/2018 academic year. This indicated that, most of the administered items across the three years were moderately difficulty, though some of the items needed some improvements. The discrimination indices for the 2014/2015 cohort indicated that, MCQ 1,3 and 4 had satisfactory discrimination indices, MCQ 2,7,8,16,17 and 20 had an acceptable discrimination indices, MCQ 9,12 and 13 had marginal discrimination indices and the MCQ 4,5,6,10,11,14,15,18 and19 had poor discrimination indices. For 2015/2016 cohort MCQ 1,2,6,10 and16;MCQ 9,12 and 17;and MCQ 3,4,7,8,12,14,18,19 and 20 had acceptable, marginal and poor discrimination indices, respectively with the absence of satisfactory items. Furthermore, the discrimination indices for MCQ 9; MCQ 8; MCQ1,4,15 and 16;and MCQ 2,3,5,6,7,10,11,12,13,14,17,18,19 and 20 for 2017/2018 cohort were satisfactory, acceptable, marginal and poor, respectively. The obtained results informed that 53.33% of the items administered across the three years had poor discrimination indices failing to discriminate higher and lower achievers. According to Mahrens and Lehman (1991) pointed out the reasons for discrimination indices being poor as firstly, the items being more difficulty or easy and hence lowering their discrimination power; and secondly, the purpose of the items in relation to the total test of which influence the magnitude of their discrimination power. These delineations demonstrate that the obtained results in this study indicated more moderately difficulty with either fewer easy or difficult items. Furthermore, negative discrimination indices were observed in MCQ 6, 15 and 19 of the 2014/2015 cohort and MCQ 17 of the 2017/2018 cohort. There are reasons for items to have negative discrimination indices. Quaigrain and Arhin (2017) argued that, wrong and ambiguous key in framing a question contributes to the negative discrimination power. Furthermore, Matlock-hetzel (1997) pointed out that items with negative discrimination indices are useless and that tends to lower the validity of the test. Therefore, so long as the items with negative discrimination indices are observed in the test, they should be examined to determine why a negative value was obtained (Quaigrain & Arhin, 2017).   Table 3 showed the difficulty and discrimination indices of the matching items questions, (MIQ) for the selected three cohorts. During 2014/2015 cohort, 17 items were administered. 20 items were administered for both 2015/2016 and 2017/2018 cohorts. The results indicated that, MIQ 4; MIQ 1,2,3,5,6,7,8,9,10,11,12 and 16;and MIQ 13,14,15 and 17 were easy, moderately difficulty and difficulty items during the 2014/2015 cohort. MIQ 2,5,6,7,8,9,10,11,14,15,17,18,19 and 20 were moderately difficulty while MIQ 1,3,4,10,12,13 and 16 were difficulty items during the 2015/2016 cohort with no easy items. MIQ 12,17 and 18 were easy items, MIQ 1,2,3,4,5,6,8,9,10,11,13,14,15,16 and 20 were moderately difficulty items while MIQ 7 and 19 were difficulty items during the 2017/2018 cohort. On the other hand, discrimination indices for MIQ 1,2,3,4,8,11,12,14 and 16; MIQ 10; MIQ 5,6, and 9; and MIQ 7,13, 15 and 17 indicated that, the items were satisfactory, acceptable, marginal and poor, respectively during the 2014/2015 cohort. MIQ 7,14 and 20;MIQ 5,9 and 17;MIQ 2,13,18 and 19;and MIQ 1,3,4,6,8,10,11,12,15, and 16 had satisfactory, acceptable, marginal and poor discrimination indices, respectively during the 2015/2016 cohort and MIQ 3,9 and 20;MIQ 1,2,5,6,8,12,15 and 16;MIQ 10,11 and 16;and MIQ 4,7,17,18 and 19 had satisfactory, acceptable, marginal and poor discrimination indices, respectively during the 2017/2018 cohort.   Table 4 showed the difficulty and discrimination indices for the essay type questions (ETQ) for the three cohorts. The results indicated that both ETQ 1and 2 2014/2015 cohort has moderate difficulty indices, ETQ 1 was easy while ETQ 2 and 3 were moderately difficulty in the 2015/2016 cohort. ETQ 1 and 2 during the 2017/2018 cohort were moderately difficulty and easy items, respectively. Both ETQ in 2014/2015 had acceptable discrimination indices; ETQ 1 had marginal discrimination power, while ETQ 2 and 3 were considered to have poor discrimination power during the 2015/2016 cohort. On the other hand, ETQ 1had high and satisfactory discrimination power while ETQ 2 was observed to have poor discrimination index. Table 4: Difficulty and Discrimination Indices of Essay items Table 5 compared the means of difficulty and discrimination indices for the university examinations administered to the three cohorts. Though it appears that the difficulty indices were moderately difficulty across the three years, the discrimination indices for MCQ were on average poor, marginal poor and satisfactory items for all examinations administered to the three cohorts was statistically not significant for the two types of questions (p˃0.05 for MCQ and MIQ) with exception to the discrimination index for MIQ which shows significant variation across the three years (p˂0.05) as shown in Table 5.

Conclusion
This study revealed that majority of the questions for the EDP 100 though were moderately difficulty, their discrimination powers were poor. However, the variation in difficulty and discrimination indices for the three cohorts was statistically not significant with exception to the discrimination index for MIQ which vary significantly across years. Also, there is a drop in internal consistency across the three cohorts. This could be partly associated with the increase in numbers of students in universities leading to increased instructors' workload that may limit instructor's time for concentrating on test construction effectively. Therefore, to tackle these challenges instructors need to be conversant with the knowledge of item analysis and apply it frequently during formative assessment. Such analysis will enhance identification of strong and weak test items as early as possible before they are included in the summative assessments tools. It is also recommended that similar studies should be done to determine both validity and reliability of the assessment tools for the other subject at the University.