Evaluating the accuracy of our KS2 SATs predictions

September 12, 2024

Joshua Perry

Assessment

KS2

SATs

National Assessments

One of our most popular products is Key Stage 2 practice SATs. We set up 4 different windows across 4 half terms: Autumn 1, Autumn 2, Spring 1 and Spring 2, each with a different past paper. Schools enter their data and we provide a range of standardised grades, including percentile ranks and predicted performance indicators (I.E. BLW, WTS, EXS, HS).

There’s a lot more to Smartgrade — for example we provide question and topic level analysis so that teachers can interrogate where there are gaps in learning — but for the purpose of this blog, we’re going to focus on the overall grades. Specifically, we’ll be examining how accurate our predicted performance indicators prove to be. Last year we had over 200 schools participate, leading to cohorts of up to 10,000 students for individual assessments, so we know we’ve got enough data to run a solid standardisation. However, to analyse accuracy we also needed some actual SATs grades to compare with our predictions. Thankfully, a few of our partner MATs with over 50 schools and 3,000 Y6 students between them supported us in accessing actual SATs data for students for whom we also have KS2 practice SATs data, so we’re now able to answer the burning question: how accurate are our estimates?

Performance Indicator predictions

Looking at the performance indicator grades from our Spring 2 practice SATs assessments, what we found is that across Mathematics, Reading and GPS, we correctly predicted the SATs outcome 76% of the time. In other words, if we told you that a student was on track for EXS in Spring 2, they had a 76% likelihood of achieving EXS in their actual SATs exams. Maths predictions were the most accurate, with an 82% accuracy rate. GPS was next at 79% and Reading was 68%. It’s also worth noting that the student performed at the level of our prediction or better 89% of the time; or in other words, only 11% of students underperformed their prediction.

The following chart shows the full breakdown, with 0 representing an exact match between the two grades, -1 indicating that we estimated a lower grade than the actual performance (e.g. we estimated EXS and they got HS), and +1 indicating that that we were a grade above the final outcome (e.g. we estimated HS and they got EXS):

The key thing to note here is the broadly similar distribution of grades either side of 0. To be precise, we estimated a lower grade 11% of the time and a higher grade 13% of the time. In Mathematics, there was no variance at all — we were up 9% of the time and down 9% of the time. In terms of confidence in our methodology, the balance of grades on either side is in some ways even more important than the accuracy, in that it shows we were not significantly over- or under-estimating performance. Of course, it is impossible to get 100% accuracy, because exam performance varies based on a range of factors including how a student feels on the day, how closely the paper’s content matches with their knowledge, and so on. So what we care about is bothaccuracy and any potential skew in prediction distribution.

Scaled Score predictions

Another way of looking at things is to convert our predictions into predicted scaled scores, and plot them against the actual scaled scores. Here’s the example of what we found when we did this for maths:

A few of things stand out from looking at the converted scaled score data:

Overall our predictions were pretty accurate — the line of perfect correlation pretty neatly bisects the results distribution. That said, the data does reveal that when analysed this way our predictions were slightly on the pessimistic side: the average actual scaled score was around 0.3 points higher than the average predicted score.
Each SATs scaled score outcome could be achieved from a really wide range of scaled scores, even this close to the final assessment. For example, an actual scaled score of 100 was achieved by students with a predicted scaled score of between 94 and 108! That isn’t a complete surprise — we know that assessment performance can vary in this way — but it is important to understand when analysing practice SATs data.
Looking at the results probabilistically, we found that achieving a maths predicted scaled score of 100 in Spring 2 led to an actual SATs score of 100 or more 72% of the time. Clearly that’s not enough to say that a predicted scaled score of 100 means the student is securely at the expected standard of 100+; to get to a 95%+ chance of achieving 100+ you’d have needed 103+ in your practice SATs.
Doing a similar analysis for the Higher Standard threshold of 110, we saw that a predicted scaled score of 110 in spring 2 led to an actual SATs outcome of 110+ 62% of the time. However, to get a 95% chance of achieving 110+, you needed a predicted scaled score of 117+.
The Pearson Correlation Coefficient (where 1 is a perfect linear correlation) for the relationship between our two scaled score datasets for mathematics is 0.92, which is pretty good! As a reference point, when the Education Endowment Foundation wrote a paper looking at the correlation between commercial standardised tests and actual exam outcomes was between 0.7 and 0.8 for Mathematics and 0.6 and 0.7 for Reading.

Of course, all this relates to our Spring 2 assessments, and we also do practice SATs at 3 other points in the year. So we’ve also interrogated our Autumn 2 practice SATs data, and we’ve found pretty similar results: at the Performance Indicator level, Maths was accurate 80% of the time (-2% vs Spring 2), GPS was accurate 74% of the time (-5% vs Spring 2), and Reading was accurate 69% of the time (+1% vs Spring 2).

We were particularly struck by the Reading results: it seems intuitively correct that Reading is the hardest to predict accurately, given that results are likely to vary based on a pupil’s familiarity with the texts included in the exam — think for example of the notorious Bats in Texasextract in summer 2023. However, we weren’t sure if proximity to the exam would improve the prediction accuracy, and for reading at least, it seems like it doesn’t.

Conclusions

Reflecting on what this all tells us, two big things stand out:

First, we think this shows we’re providing a useful and accurate service! We’re proud both of the accuracy of our predictions and the relatively unskewed nature of distributions. That said, we’re always looking to get better, and so we’re planning to make some tweaks to our standardisations algorithms in the coming months to further increase our prediction accuracy.

Second, we’ve decided that we’re going to offer predicted scaled scores in the product. We’ll do this for practice SATs initially, and we’re doing further research before deciding whether to take a similar approach with our other partner assessments. We’re doing this because we found the granularity of the predictions really useful. For example, we think it would help a teacher to know whether a child’s spring 2 maths predicted scaled score is 103 (at which point we can be somewhat confident in the child achieving the expected standard), or just 100 (at which point they are still not secure in their likelihood of hitting the expected standard).

Finally, we want to say a big thank you to the MATs that assisted us in our research. At Smartgrade we really care about the integrity of our service, and we are hugely grateful to our friends and partners who support us in self-reflecting and assessing our own performance so that we can continue to make the product better.

To find out more about our Practice SATs package, book a 30-minute personalised demo with one of our team.

Sign up to our newsletter

Join our mailing list for the latest news and product updates.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Everything you need for smarter assessments

Book a Demo Today