Restarting and repeated assessment

Exact has two forms of equivalent difficulty: Form A and Form B; and they allow for retesting or repeated assessment if desired. The two forms can be alternated over time in order to record progress, e.g. in response to intervention given to students with literacy difficulties (referred to here as ‘continuous assessment’). However, the current version of Exact was designed primarily for identifying students who require access arrangements in examinations – i.e. for identifying significant weaknesses in literacy skills – not specifically for continuous assessment, which is focused on measuring improvement in literacy skills as a result of educational input.

Cautions regarding retesting and repeated assessment

When embarking on retesting or continuous assessment, it is particularly important to remember that when students are assessed on any psychometric test (whether administered conventionally or by computer) and the test (or a parallel form of it) is given again some time later, it must not be expected that the scores of all students will either stay the same or increase. Inevitably some will show a decline in scores, but this should not be taken to indicate that these latter students have necessarily decreased in the relevant ability. The reasons for this include not only the unsystematic and unpredictable variations to which all human performance is naturally subject, but also certain systematic factors that can dramatically influence test results. These factors include: rate of working, practice effects and regression to the mean. If misinterpretation of results is to be avoided when Exact is used repeatedly, it is important that administrators understand these factors and are fully aware of their possible impact on results.

Sloat, E. & Willms, J.D. (2000) The International Adult Literacy Survey: Implications for Canadian Social Policy. Canadian Journal of Education, 25(3), 218-233.

Rate of working

Since Exact was designed primarily for access arrangements assessments, all the assessments have strict time constraints (i.e. they are ‘timed’ tests). The reason behind this has already been explained previously. In all timed tests the rate at which the student works is an inherent factor in determining the results. The type of time constraint differs across the tests and this is more important in certain tests than others. In the tests of word recognition and spelling, a given time is allowed for each item; a fixed time limit (5 seconds per item) in the word recognition test and a variable time limit (geared to word length) in the spelling test. In the reading comprehension, handwriting to dictation and typing to dictation tests, an overall time limit is imposed rather than a time limit per item. In reading comprehension, 10 minutes is allowed for the whole test, and students are required to attempt as many items as they can within that time (but not necessarily to attempt all the items). Both handwriting to dictation and typing to dictation have an overall time limit of 7 minutes.

The word recognition and spelling tests may be regarded as simple tests because:

they comprise a large number of items
items are independent of each other
the student either knows or does not know the answer
as soon as one item has been completed the student is immediately presented with the next.

This means that the task is automatically paced by the time constraints placed on each item. If an item is not completed within the time limit, the program automatically advances to the next item, and so on. Although variation in speed of working between different students is likely to affect the results, variation in the speed at which an individual student works on different testing occasions is unlikely to affect the results very much. Consequently, time is a less important factor in these tests, and differences in scores on these tests from one occasion to the next are principally a result of changes in student ability rather than speed of working.

The dictation tests employ a different type of time constraint but the nature of the tasks means that each item (i.e. a phrase heard by the student) can be regarded as essentially independent of the other items (i.e. previous and subsequent phrases in the passage). As soon as the student signals they have finished writing or typing one phrase, the computer automatically gives the next phrase, and so on until the time limit is reached. The task is self-paced and, in principle, differences in scores on these tests from one occasion to the next could arise from the student not working as hard on one of the occasions as on the other. In practice, however, it turns out that time is not such an important factor because item independence coupled with task simplicity results in differences between testing occasions being principally attributable to changes in writing ability.

The reading comprehension test contrasts markedly with the other tests in the Exact suite. Text passages are presented, and several items relating to each passage have to be attempted within the time limit. To answer questions, the student may have to refer back to the text or consider answers to previous questions. Hence this test may be regarded as complex rather than simple because items within a passage are not independent and it is not a case of either knowing or not knowing the answer but, rather, of being prepared to devote sustained mental effort at an optimum rate over the whole task in order to work out each answer. The ‘optimum rate’ is a speed that is consistent with their word recognition and verbal comprehension ability. If they exceed their optimal speed, they will make more word recognition errors and be more likely to misunderstand sentences, which will result in a lower score. If they read slower than their optimal speed, they will have insufficient time to attempt all the passages and so be prevented from the opportunity of answering more questions than the less able readers, which will also result in a lower score. Either way, the student will appear to be a less able reader than is really the case.

It should be apparent that if the student is tired, less well-motivated or not in a positive mood, or if they perceive that the consequences of less effort will not matter very much, they will tend to work slower and be less inclined to put in the necessary cognitive effort. If this happens, their score will be unlikely to reflect their true ability.

Practice effects

‘Practice effects’ are the positive or negative psychological impacts of previous assessment(s) on a student’s performance.¹¹ Positive impacts, which include factors such as item familiarity and increased confidence as a result of previous experience with the tasks, tend to inflate scores on subsequent assessment occasions. Negative impacts, which include factors such as decreased motivation due to boredom with the tasks or overconfidence as a result of feedback from previous assessments, tend to deflate scores on subsequent assessment occasions. In general, the magnitude of practice effects is a function of how often students have been assessed and the time interval between assessments. Both positive and negative psychological impacts tend to increase as the time interval between assessments decreases. Furthermore, practice effects will not necessarily affect all students to the same extent. Some students may experience more negative effects, while others may experience more positive effects.

11 For further explanation of practice effects see: Kulik, J.A., Kulik, C-L.C. and Bangert, R.L. (1984) Effects of Practice on Aptitude and Achievement Test Scores. American Educational Research Journal, 21, 435-447.

Regression to the mean

All test scores, by their very nature, are variable. On any psychometric test the actual score obtained is an estimate of the student’s true score, which will fall within a certain range of the actual score; this range is known as the ‘confidence interval’. It means that one can have a certain level of confidence (in this case 90% confidence) that the student’s true score lies within a range of the actual score that is equal to plus or minus the ‘confidence interval’. On another occasion on the same test, the same student is likely to score slightly differently, which could be higher or lower than the previous score. The confidence interval for any test is determined by the test’s reliability – i.e. the extent to which it can be relied on to give the same result on another occasion.

Consequently, human beings do not perform at the same level on every occasion, and some assessment tasks are more influenced by this variability. Over time, a person’s skills may increase as a result of learning, practice and general experience. However, many other things also influence performance, such as mood, motivation, tiredness, instructions and perceived consequences. As explained above, simple tests, i.e. ones where the student either knows or does not know. the answer (e.g. Exact word recognition and Exact spelling) are less subject to such influences than more complex tests, such as Exact reading comprehension, where it is not a case of either knowing or not knowing the answers but, rather, of being prepared to devote sustained mental effort at an optimum rate over the whole task in order to work out the answers.

Coupled with the general tendency of any measurement process to involve a degree of random error, these natural variations in test scores result in the statistical phenomenon known as ‘regression toward the mean’, whereby a score that is extreme (meaning further away from average performance) on its first measurement, will tend to be closer to the average on its second measurement, and if it is extreme on its second measurement, it will tend to have been closer to the average on its first.¹²

¹² Upton, G. & Cook, I. (2006) Oxford Dictionary of Statistics, Oxford University Press.

Stigler, S.M. (1997). Regression toward the mean. Statistical Methods in Medical Research, 6, 103–114. Tweney, R.D. (2013) Reflections on regression toward the mean. Theory and Psychology, 23, 271-274.

What interval should be allowed before retesting

The previous three subsections make clear that, when carrying out repeated testing, variation in test performance is always to be expected, and gains cannot be counted on. Even when the best teaching has been provided, it is likely that a few students will exhibit apparent drops in performance from one test occasion to the next. This is due to various factors, including the impact of rate of working (more noticeable in complex as opposed to simple test formats), practice effects (more pronounced if the interval between testing occasions is short), and regression toward the mean (scores that are extreme on the first measurement will tend to be closer to the average on the second measurement). When interpreting results and conveying results to pupils, teachers or parents, or if using results to demonstrate ‘value added’, it is essential that administrators take these factors into account and avoid drawing naïve or simplistic conclusions from changes in scores from one testing occasion to the next.

Therefore, it is recommended that, in normal circumstances, the interval between successive assessments should be preferably one year or, at the very least, one term or semester. Even though there are two parallel forms, if the period between successive assessments is relatively short (i.e. a matter of weeks or up to a school term or semester), practice effects could still arise and confound results. Research has shown that when retesting after a long school holiday performance is more likely to have declined.¹³

Occasionally exceptional situations may arise where a teacher needs to re-administer one or more of the tests in Exact after a much shorter interval, e.g. if it is discovered that when the student first took the tests he or she was unwell, or where a fire drill interrupted the assessment, or if the student was clearly not applying proper attention or effort to the tasks. In such cases, the results are unlikely to give a true indication of abilities and it is permissible to re-test the student. Nevertheless, there should be a delay of at least two weeks before re-administering the test(s) and the alternative form should be used. The first result should be discarded and the second result should be regarded as the true result.

13 Sainsbury, M., Whetton, C., Mason, K. and Schagena, I. (1998) Fallback in attainment on transfer at age 11: evidence from the Summer Literacy Schools evaluation. Educational Research, 40, 73-81.

Davies, B. & Kerry, T. (1999) Improving student learning through calendar change. School Leadership and Management, 19(3), 359-371.

Restarting and repeated assessment