PSLE Analysis Webapp
No data files are uploaded. All analysis is done locally on-device.
Webapp developed by Gongshang Primary School
Within-cohort comparisons are more meaningful than inter-year comparisons.
Reliability
Reliability means consistency.
It tells us whether a student's result reflects their real learning, not luck or chance.
Consistency Across Tasks
Consistency comes from patterns across tasks, not repeating the same test.
We don't check reliability by running the same test twice. Instead, we look for consistent performance across similar kinds of questions or tasks.
If a child can answer a concept in different forms, the result is more trustworthy.
Why Short Tests Are Fragile
More well-chosen items give more stable results.[1]
Short tests are fragile.
A single question or tiny quiz cannot give a stable picture of learning.
A richer mix of well-designed items produces more reliable scores.
When evaluating curriculum-level learning, we prioritise lengthier, one-time assessments that sample learning broadly and consistently i.e. we use single-seating EOY results for longitudinal comparisons.
Data to inform progression involving the weighting of bite-sized assessments are separate policy decisions.
Marking Consistency
Clear marking reduces noise.[2]
Clear, specific marking guidelines make scoring more consistent.
When teachers share the same interpretation of the criteria, disagreement drops.
More consistent scoring means higher reliability.
In essence:
A reliable assessment gives similar results across similar questions or scorers.
Consistency builds trust that the score reflects the child's learning and that score is not random.
Validity
Validity is about interpretation.[3]
A test is valid when the scores truly reflect the kind of learning we claim to measure.
PSLE Validity
For curriculum-level assessments like PSLE, this means:
a. Items must represent the key knowledge and skills in the syllabus;
b. Scoring must reward the intended thinking;
c. Results must be consistent enough across tasks to support the conclusions we draw.
The PSLE can be considered valid to the extent that exam items and school teaching align well with the syllabus.
Inter-Year Comparisons
Within-cohort comparisons are almost always more meaningful than inter-year comparisons.[4]
Inter-year comparison confounds too many variables.[5]
Across different years, (i) curriculum, (ii) cohort composition, and (iii) exam difficulty changes. Across cohorts, (a) students, (b) teachers, (c) syllabus and (d) papers change too. There is no stable reference, resulting in low validity.
It is more meaningful to interpret test outcomes within the same cohort than to compare PSLE results across years. Validity rests on whether the scores support the interpretation we claim, and year-to-year comparisons do not meet that standard.[6]
Assessment Leadership Implications
Inter-year comparison of results is not a valid method to assess teaching quality.[7] Such comparisons risk misinformed decisions such as casting doubt on good teachers on "data-driven" analysis which is invalid.
However, asking about teaching quality is an important leadership question.
If done well, we protect evidence-based leadership that results in playing to teachers' strengths. If we use the wrong tool, we misuse a measurement that is
(i) not valid for teacher appraisal,
(ii) not statistically significant,[8]
(iii) produces more noise than data,
and will result in flawed personnel decisions.
Mathematical Notes
Purpose: These notes provide mathematical foundations for claims made in the Assessment Primer.
They are written for readers with mathematical training but not necessarily statistical background.
All formulas can be verified against standard psychometric texts or the cited AMEE Guide.
Reliability increases with the square root of test length (Spearman-Brown prophecy formula):
Worked example:
Practical implication: Doubling test length doesn't double reliability—gains diminish. But very short tests (n<20) are especially unstable.
Source: Schuwirth & van der Vleuten (2011), pp. 11-13; Classical Test Theory
Total observed variance decomposes as:
Inter-rater reliability reduces σ²_error. When two markers assign different scores to the same work, that difference is pure noise.
Quantifying: If marker agreement (Intraclass Correlation Coefficient) is 0.70, then 30% of score variance is unexplained noise.
Source: Schuwirth & van der Vleuten (2011), pp. 10-11; Generalizability Theory
Modern validity theory (Kane, 2006) views validity as a chain of inferences:
Critical insight: Validity fails if any link is weak.
A perfectly reliable test (consistent scoring) can still be invalid if it measures the wrong thing.
Example: A spelling test might be perfectly reliable (consistent scores) but invalid for assessing "writing ability" if we claim the scores represent composition skill.
Source: Schuwirth & van der Vleuten (2011), pp. 7-10; Kane's argument-based validity
Expertise research shows performance is highly domain-specific: a student's ability on Topic A poorly predicts ability on Topic B, even within the same subject.
Implication for year-on-year comparison:
If Year 1 paper emphasizes fractions (60% of marks) and Year 2 emphasizes geometry (60% of marks), we're essentially comparing different constructs.
Even if both are labeled "P6 Math," the content sampling makes direct comparison invalid.
Source: Schuwirth & van der Vleuten (2011), pp. 3-6; expertise development theory
When comparing Year A to Year B, the observed difference is:
The mathematical problem: We cannot isolate Δ_teacher because:
- All variables change simultaneously
- We have only one observation per year (n=1 for each cohort)
- We have 5 unknowns but only 1 equation
This system is under-identified—there are infinitely many solutions.
Example: If average score drops 5 points, is it because:
All scenarios fit the data equally well. We cannot determine which is true.
This is not a statistical inference problem—it's a mathematical impossibility.
Source: Standard linear algebra; identification problem in econometrics
Every score contains measurement error. The SEM quantifies this:
Worked example:
Interpretation: The confidence intervals overlap substantially in range [64.4, 75.6].
The 4-point difference (72 vs 68) could easily be measurement noise, not real change.
Statistical test:
To claim Year A "outperformed" Year B with confidence, we'd need non-overlapping intervals—typically requiring differences of 15+ points, not 4.
Source: Schuwirth & van der Vleuten (2011), pp. 11-13; Classical Test Theory
Attributing score changes to teacher quality commits the "fundamental attribution error"—overweighting individual factors while underweighting situational factors.
Variance decomposition in student achievement shows:
Implication: Even if we could isolate teacher effects (we can't—see Note 5), they explain only 10-15% of variance.
Numerical example:
Better approaches:
- Value-added models: Track the same teacher across multiple cohorts with statistical controls for student intake ability
- Direct observation: Lesson observations, student work analysis, peer review
- Within-teacher comparison: Compare student growth within one teacher's class
Source: Hattie (2009) Visible Learning; educational effectiveness research literature
Class-level comparisons suffer from small sample size (n ≈ 30-40 students per class).
Power analysis: To detect a "true" 5-point difference at p<0.05 with 80% power:
We're underpowered by half. With n=35, we can only reliably detect differences of ~7 points or more.
Minimum detectable difference with n=35:
Practical consequence: Most year-on-year differences (typically 2-3 points) are statistically indistinguishable from zero.
Acting on them is acting on noise, not signal.
Source: Standard power analysis; Cohen (1988) Statistical Power Analysis for the Behavioral Sciences
Inter-year comparisons fail mathematically because:
- Under-identification: Confounding variables cannot be isolated (Note 5)—this is solving 1 equation with 5 unknowns
- Measurement error: Error is too large relative to signal (Note 6)—confidence intervals overlap completely
- Insufficient power: Sample sizes are too small for reliable inference (Note 8)—we can't detect differences smaller than 7 points
- Content sampling: Different papers test different constructs (Note 4)—only 16-25% shared variance
- Attribution error: Teacher effects are only 10-15% of total variance (Note 7)—we're overestimating impact by 8×
These are not "challenges to overcome"—they are mathematical impossibilities.
The right tools exist for assessing teaching quality: within-cohort value-added analysis, direct classroom observation, and curriculum alignment reviews.
Inter-year grade comparison is not one of them.
Load Previous State (Optional)
Import a previously exported JSON file to restore teacher assignments and student groupings.
Click or drop session JSON file here
Import teacher and student data from a previous session.
Step 2.1: Upload RES_106
Load the Subject Ranking (RES_106) Excel file from School Cockpit Plus. Data stays on this device.
Click or drop RES_106 Excel file here
Only .xlsx files from SC+ RES_106 report are supported.
How to download the RES_106 file
Login into School Cockpit Plus https://schoolcockput.moe.gov.sg (VPN needed if working outside school).
Under SC Applications, select CSR → Subject Analysis for Primary → Ranking.
Choose P6 level and choose all Subjects.
Choose Year I as current P6 year, Year II as (n-1) and Year III as (n-2).
Choose Results Type I as Prelim, Type II/III as End-of-year Exam.
Unstacked Student Summary
Step 2.2: Upload PSLE
Load the PSLE Results Excel file from MOE systems. This contains AL bands for each subject.
Click or drop PSLE Excel file here
Only .xlsx files from MOE PSLE results are supported.
About PSLE Results File
The PSLE file contains AL bands (Achievement Levels) for each student:
- English Language - Standard (AL 1-8) or Foundation (A/B/C → 6/7/8)
- Mother Tongue - Standard or Foundation, plus Higher MTL grades (D/M/P)
- Mathematics - Standard or Foundation
- Science - Standard or Foundation
- PSLE Score - Sum of all AL bands
- Posting Group - Eligibility: 1 (best), 2, or 3
Merged Student Data (PSLE + RES_106)
Step 2.3: Subject Teachers (Optional)
Assign teachers to subjects for each class. This allows filtering by teacher in analysis modules. Skip if not needed.
Step 2.4: Students (Optional)
Configure ability grouping for regular classes and assign students to foundation/MTL groups.
Step 2.5: Student Past Data (Optional)
Add historical class and teacher data from P4 and P5.
Past Classes
Track which classes students were in during P4 and P5.
Click or drop filled past classes template here
Upload the completed Excel file with P4/P5 class assignments.
Past Teachers
Record which teachers taught each subject during P4 and P5.
Click or drop filled past teachers template here
Upload the completed Excel file with P4/P5 teacher assignments.
Item 1.6: Finalise
Finalize your setup and enable analysis modules.
Setup Complete
All required data has been loaded. Click "Finalize Setup" to enable the analysis modules (Item 3 and onwards).
Item 2: PSLE Data Analysis
View course eligibility statistics based on PSLE posting groups.
Loading PSLE data...
Item 3A: Subject Analysis by Class
View AL distribution for each subject by class (e.g., 6 COURAGE, 6 HARMONY).
Assessment
Subjects
Classes
Chart Mode
Select filters above to view results.
Item 3B: Subject Analysis by Pull-out Group
View AL distribution for each subject by pull-out groups (foundation, MTL groups).
Assessment
Subjects
Groups
Chart Mode
Select filters above to view results.
Item 4.1: Past PSLE Data
Upload historical PSLE data for trend analysis. Each file will be analyzed separately like Item 3.
PSLE (Loading...)
Click or drop PSLE Excel file here
Upload PSLE data for the previous cohort (GradYear-1).
PSLE (Loading...)
Click or drop PSLE Excel file here
Upload PSLE data for two cohorts ago (GradYear-2).
About PSLE Results File
The PSLE file contains AL bands (Achievement Levels) for each student:
- English Language - Standard (AL 1-8) or Foundation (A/B/C → 6/7/8)
- Mother Tongue - Standard or Foundation, plus Higher MTL grades (D/M/P)
- Mathematics - Standard or Foundation
- Science - Standard or Foundation
- PSLE Score - Sum of all AL bands
- Posting Group - Eligibility: 1 (best), 2, or 3
PSLE Analysis: GradYear-1
PSLE Analysis: GradYear-2
Item 4.2: 3-Year Trend Analysis
Compare PSLE performance across three cohorts.
Complete Item 4.1 to view trend analysis.
Item 4.3: Correlation of ALs
View correlation matrices showing how PSLE and Prelim ALs relate for each subject.
Loading correlation analysis...
Item 4.4: Flow of Results
Visualize individual student performance trajectories across P4 EOY, P5 EOY, P6 Prelim, and PSLE.
Assessment
Subjects
Classes
Loading flow visualization...
Item 4.5: Flow of Cohort Standing
Visualize individual student cohort rank trajectories across P4 EOY, P5 EOY, P6 Prelim, and PSLE.
Assessment
Subjects
Classes
Visualization Enhancements
Loading rank flow visualization...
Item 4.6: P5 Predictability
Evaluate how well P5 EOY performance predicts P6 Prelim ranks and PSLE AL scores.
Classes
Loading predictability analysis...
Item 4.7: Correlations
Create pairplot matrices to explore correlations between any subject@assessment@scoretype combinations.
Classes
Horizontal Axis Variables
Vertical Axis Variables
Add variables to both axes to generate the correlation matrix.
Export State
Export your complete session data to save your work.
Save Your Progress
Export your complete session including all loaded data (RES_106, PSLE, teacher assignments, student groupings, and past data if provided).
You can import this file in "Load Previous State" next time to restore your entire configuration.