The Development and Validation of Science Achievement Test

Academic achievement is often regarded as a determinant of academic success. One of the most common assessment tools to evaluate the student achievement is through a well set achievement test. The main objective of the study was to develop and validate an achievement test in Science for senior secondary school students of grade 10. A Science Achievement Test (SAT) was developed by an excellent teacher in science subject based on Malaysia Science Curriculum Specification. The SAT consists of 50 multiple choice questions, which include the all 8 chapters in grade 10 science. The SAT had been validated by experts and analysed to check the difficulty index (p) and discrimination index (d), internal consistency reliability. Data were obtained from a purposive sampling of 50 students in a pilot study carried out in a secondary school in Sarawak, Malaysia. Based on the difficulty index, there are 12 easy items, 33 moderate items and 5 difficult items. 7 items found to be poor items and needed to be modified or removed due to the poor discriminating power. The reliability of SAT based on KR20 and Split half method showed the coefficient of 0.862 and 0.851 respectively. From the study, the SAT was a valid and reliable tool for measuring the students achievement in science.

assessment (Baig et al., 2014). Moreover, MCQs are appropriate for measuring knowledge, comprehension and could be designed to measure application and analysis (Abdel-Hameed et al., 2005). MCQs are being used increasingly due to their higher reliability, validity, and ease of scoring (Case & Swanson, 2003;Tarrant & Ware, 2012).
According to Kamaruzaman (2003), item analysis needs to be done to determine whether a constructed item is good or weak. Good and weak items can be specified with a Difficulty Index (F) value. Meanwhile, discrimination index (D) is an index used to compare high performing and low performing students (Shafizan, 2013;Cohen et al., 2011;Kamaruzaman, 2003). According to Hopkins et al. (1990), if the mean value of the D value is high, then the test has high reliability.

Method
The Science Achievement Test (SAT) is a paper and pencil test designed to study student achievement levels. This test consists of 50 multiple choice questions. SAT was developed by an excellent teacher who was experienced in teaching science subject for more than 10 years. The content validity of SAT was later done by two experts in science subjects in terms of its contents and format. The SAT was drafted based on the Form Four Science Curriculum Specification (Curriculum Development Section, 2012) and the Test Specification Table  (TST) according to the Bloom's Taxonomy (Bloom, 1956).
Content analysis is another very important phase in construction of an achievement test (Sharma & Sarita, 2018). The content of the form four science subject syllabus is based on five main themes, namely, 'Introduction to Science', 'Maintenance and Continuity of Life', 'Substance and Nature', 'Energy in Life' and 'Technology and Industrial Development in Society'. These five themes are further subdivided into eight topics namely 'Scientific Investigation', 'Body Coordination', 'Generation and Variation', 'Substance and Material', 'Energy and Chemical Change', 'Nuclear Energy', 'Light, Color and Vision 'and' Chemicals in the Industry '.The test was formulated based on the latest 'Malaysian Certificate of Education' format shown in Table 3.8. The distribution of UPS items according to the four science subject topics is as shown in Table 3.9.  50 A pilot study for SAT was conducted with 50 Form 5 students at one of the national school in Limbang district. The duration of the test was one hour and fifteen minutes. The students result was used to determine its reliability and validity by analysing the item analysis, difficulty index, discriminant index and Kuder-Richardson 20 Formula.
There are differing opinions on the acceptability of difficulty index value. For example, Macinstosh and Morrison (1969) consider good F values to be between 0.4 to 0.6 while Hanna and Dettmer (2004) consider F values to be good in the range of 0.3 to 0.6. The value of difficulty index (F) can be calculated by the formula: F= Journal of Education and Practice www.iiste.org ISSN 2222-1735 (Paper) ISSN 2222-288X (Online) Vol.11, No.20, 2020 105 Note: F = Difficulty index Ru = the number of students in the upper group who respond correctly Rl = the number of students in the lower group who respond correctly Nu = the total number of students in the upper group Nl = the total number of students in the lower group In this study, the difficulty indices were analysed using the Henning (1987) guidelines as shown in the following table: High (difficult) Besides, a good item should be able to separate or discriminate between high scores and low scores on an entire test. Thus, the discriminant index for each item was also calculated. The value of D can be obtained by using the formula: D= g Note: D = Discriminant index Nu = the total number of students in the upper group Nl = the total number of students in the lower group N = total number of students in upper and lower groups The discriminant indices were analysed by referring to Ebel's (1979) suggestion as follow: Poor items, to be rejected or improved by revision From Ebel (1979, pg 267) In addition, the reliability of SAT was also done by analysing Kuder-Richardson 20 Formula. The Kuder-Richardson Formula 20 (KR-20) first published by Kuder, G.F. & Richardson, M.W. in 1937 is a measure of internal consistency (reliability) for measures with dichotomous choices. Kuder Richardson formula has two versions (KR-20 and KR-21) for achievement and psychological test items respectively. In this study, KR-20 was used since the SAT comprises different level of difficulties. The formula for estimating reliability is given by: Note: K = Number of items SD 2 = variance of scores on the test (square of the SD (standard deviation)) P = proportion of those who responded correctly Q = proportion of those who responded incorrectly Besides, the internal consistency reliability also tested by using Split half method. The usual method of Split half test is dividing the items into two equivalent halves is to take odd items in one half and all even items in the other half to calculate the reliability. And lastly find the correlation coefficient of the two halves.

Results and Discussion
Researcher have conducted item analysis to explain the Difficulty Index (F) and Discrimination Index (D) based on the results of the pilot test. This method is used to ensure that the selected item actually meets its requirements, the level of difficulty and reliability of the item is free of unnecessary information and irrelevant reflections (Cohen et al., 2011;Linn, 1993).
The table below shows the difficulty index (F) of each item according to Henning (1987).
Generally, items of moderate difficulty are to be preferred to those which are much easier or much harder (Boopathiraj & Chellamani, 2013). However, Vincent and Lajium (2014) believe that good items show an F value of between 0.30 and 0.80. There was some items which had the F value less than 0.3, which were too difficult for students. Item 8, 20, 41 and 48 were too difficult and there was a need to modify or remove it. There was only a single item considered as too easy, which was item 32, with the difficulty index 0.79 nearly to 0.80.

Conclusion
SAT is an achievement test developed to test the level and performance in science. SAT showed a high internal consistency reliability with the KR20 coefficient 0.862. Another test of reliability, the test of split half reliability coefficient was 0.851. Hence, SAT was a valid and reliable achievement test. According to the difficulty indices, there are 12 easy items, 33 moderate items and 5 difficult items. The analysis of discriminant indices showed that 7 items needed to be removed and modified. The good items may be stored in the questions bank for the future reference. The researchers, who aim to study the achievement of science subject, especially in Malaysia context, the SAT could be a good reference. The methods can be referred to those who wants to develope the achievement tests on either certain topics or different educational levels.
Other than analysing difficulty and discriminant indices, the future researchers are suggested to do the distractor analysis as well. In distractor analysis, we know how the distractors were able to function effectively by drawing the test takers away from the correct answer (Crocker & Algina, 1986). Distractors selected by students due to their misconcep-tions can inform the instructor about which skills need to improve in order to eliminate those misconceptions (Gierl et al., 2017). It could be a very interesting analysis for the test developers and increase the test quality.
To conclude, it was a challenging and complicated process in consideration of developing a quality achievement test. Hence, the test developers, espescially the educators should acquire the skills in analysing the achievement test. A proper method in analysing the difficulty indices, discriminant indices as well as the