PSLE Analysis Webapp

No data files are uploaded. All analysis is done locally on-device.

Webapp developed by Gongshang Primary School

Gongshang Primary School

Within-cohort comparisons are more meaningful than inter-year comparisons.

This primer draws on principles described in Schuwirth & van der Vleuten (2011), General Overview of Theories Used in Assessment. Download AMEE Guide #57 (PDF)

Reliability

Reliability means consistency.

It tells us whether a student's result reflects their real learning, not luck or chance.

Consistency Across Tasks

Consistency comes from patterns across tasks, not repeating the same test.

We don't check reliability by running the same test twice. Instead, we look for consistent performance across similar kinds of questions or tasks.

If a child can answer a concept in different forms, the result is more trustworthy.

Why Short Tests Are Fragile

More well-chosen items give more stable results.[1]

Short tests are fragile.

A single question or tiny quiz cannot give a stable picture of learning.

A richer mix of well-designed items produces more reliable scores.

When evaluating curriculum-level learning, we prioritise lengthier, one-time assessments that sample learning broadly and consistently i.e. we use single-seating EOY results for longitudinal comparisons.


Data to inform progression involving the weighting of bite-sized assessments are separate policy decisions.

Marking Consistency

Clear marking reduces noise.[2]

Clear, specific marking guidelines make scoring more consistent.

When teachers share the same interpretation of the criteria, disagreement drops.

More consistent scoring means higher reliability.

In essence:
A reliable assessment gives similar results across similar questions or scorers.
Consistency builds trust that the score reflects the child's learning and that score is not random.

Validity

Validity is about interpretation.[3]

A test is valid when the scores truly reflect the kind of learning we claim to measure.

PSLE Validity

For curriculum-level assessments like PSLE, this means:

a. Items must represent the key knowledge and skills in the syllabus;
b. Scoring must reward the intended thinking;
c. Results must be consistent enough across tasks to support the conclusions we draw.

The PSLE can be considered valid to the extent that exam items and school teaching align well with the syllabus.

Inter-Year Comparisons

Within-cohort comparisons are almost always more meaningful than inter-year comparisons.[4]


Inter-year comparison confounds too many variables.[5]


Across different years, (i) curriculum, (ii) cohort composition, and (iii) exam difficulty changes. Across cohorts, (a) students, (b) teachers, (c) syllabus and (d) papers change too. There is no stable reference, resulting in low validity.


It is more meaningful to interpret test outcomes within the same cohort than to compare PSLE results across years. Validity rests on whether the scores support the interpretation we claim, and year-to-year comparisons do not meet that standard.[6]


Assessment Leadership Implications

Inter-year comparison of results is not a valid method to assess teaching quality.[7] Such comparisons risk misinformed decisions such as casting doubt on good teachers on "data-driven" analysis which is invalid.

However, asking about teaching quality is an important leadership question.

If done well, we protect evidence-based leadership that results in playing to teachers' strengths. If we use the wrong tool, we misuse a measurement that is

(i) not valid for teacher appraisal,
(ii) not statistically significant,[8]
(iii) produces more noise than data,

and will result in flawed personnel decisions.

Mathematical Notes

Purpose: These notes provide mathematical foundations for claims made in the Assessment Primer.

They are written for readers with mathematical training but not necessarily statistical background.

All formulas can be verified against standard psychometric texts or the cited AMEE Guide.

[1] Test length and reliability

Reliability increases with the square root of test length (Spearman-Brown prophecy formula):

r_new = (n × r_old) / (1 + (n-1) × r_old) where: n = factor by which test length increases r_old = original reliability coefficient r_new = predicted new reliability

Worked example:

Original test: 10 items, reliability = 0.60 Double to 20 items (n=2): r_new = (2 × 0.60) / (1 + (2-1) × 0.60) = 1.20 / 1.60 = 0.75 Triple to 30 items (n=3): r_new = (3 × 0.60) / (1 + (3-1) × 0.60) = 1.80 / 2.20 = 0.82

Practical implication: Doubling test length doesn't double reliability—gains diminish. But very short tests (n<20) are especially unstable.

Source: Schuwirth & van der Vleuten (2011), pp. 11-13; Classical Test Theory

[2] Measurement error and scoring variance

Total observed variance decomposes as:

σ²_observed = σ²_true + σ²_error where: σ²_true = variance due to actual differences in student ability σ²_error = variance from measurement inconsistency

Inter-rater reliability reduces σ²_error. When two markers assign different scores to the same work, that difference is pure noise.

Quantifying: If marker agreement (Intraclass Correlation Coefficient) is 0.70, then 30% of score variance is unexplained noise.

ICC = σ²_true / (σ²_true + σ²_error) If ICC = 0.70: 0.70 = σ²_true / (σ²_true + σ²_error) Therefore: σ²_error = (σ²_true / 0.70) - σ²_true = 0.43 × σ²_true Error is 43% as large as true variance As proportion of total: 0.43/(1+0.43) = 30%

Source: Schuwirth & van der Vleuten (2011), pp. 10-11; Generalizability Theory

[3] Validity as inference quality

Modern validity theory (Kane, 2006) views validity as a chain of inferences:

Observation → Score → Universe Score → Domain → Construct Each step requires justification: 1. Scoring: Did we score what we observed correctly? 2. Generalization: Would other similar tasks yield similar scores? 3. Extrapolation: Do these tasks represent the domain we care about? 4. Implication: Does the domain measure the construct we claim?

Critical insight: Validity fails if any link is weak.

A perfectly reliable test (consistent scoring) can still be invalid if it measures the wrong thing.

Example: A spelling test might be perfectly reliable (consistent scores) but invalid for assessing "writing ability" if we claim the scores represent composition skill.

Source: Schuwirth & van der Vleuten (2011), pp. 7-10; Kane's argument-based validity

[4] Domain specificity and content sampling

Expertise research shows performance is highly domain-specific: a student's ability on Topic A poorly predicts ability on Topic B, even within the same subject.

Correlation between different content domains ≈ 0.3 to 0.5 This means: r² = 0.3² to 0.5² = 0.09 to 0.25 Only 9-25% of variance in one topic is explained by another 75-91% of performance variance is topic-specific

Implication for year-on-year comparison:

If Year 1 paper emphasizes fractions (60% of marks) and Year 2 emphasizes geometry (60% of marks), we're essentially comparing different constructs.

Even if both are labeled "P6 Math," the content sampling makes direct comparison invalid.

Example: Year 1 paper: 60% fractions, 20% geometry, 20% word problems Year 2 paper: 20% fractions, 60% geometry, 20% word problems Shared variance ≈ 0.4 × r_within_math ≈ 0.4 × 0.4 = 0.16 Only 16% of Year 1 performance predicts Year 2 performance 84% is different content

Source: Schuwirth & van der Vleuten (2011), pp. 3-6; expertise development theory

[5] Confounding variables in inter-year comparison

When comparing Year A to Year B, the observed difference is:

Δ_observed = Δ_teacher + Δ_student + Δ_curriculum + Δ_difficulty + ε where: Δ_teacher = teacher quality change Δ_student = cohort ability change Δ_curriculum = syllabus/emphasis change Δ_difficulty = paper difficulty change ε = random error

The mathematical problem: We cannot isolate Δ_teacher because:

  • All variables change simultaneously
  • We have only one observation per year (n=1 for each cohort)
  • We have 5 unknowns but only 1 equation

This system is under-identified—there are infinitely many solutions.

Example: If average score drops 5 points, is it because:

Scenario A: Δ_teacher = -5, all others = 0 Scenario B: Δ_student = -5, all others = 0 Scenario C: Δ_difficulty = -5, all others = 0 Scenario D: Δ_teacher = -2, Δ_student = -2, Δ_difficulty = -1, ε = 0 Scenario E: Δ_teacher = +3, Δ_student = -8, ε = 0 ...infinitely many more scenarios

All scenarios fit the data equally well. We cannot determine which is true.

This is not a statistical inference problem—it's a mathematical impossibility.

Source: Standard linear algebra; identification problem in econometrics

[6] Standard Error of Measurement (SEM)

Every score contains measurement error. The SEM quantifies this:

SEM = SD × √(1 - reliability) 95% Confidence Interval = Score ± 1.96 × SEM

Worked example:

Year A: Mean = 72, SD = 10, reliability = 0.85 SEM = 10 × √(1-0.85) = 10 × √0.15 = 10 × 0.387 = 3.87 95% CI = 72 ± (1.96 × 3.87) = 72 ± 7.6 = [64.4, 79.6] Year B: Mean = 68, SD = 10, reliability = 0.85 95% CI = 68 ± 7.6 = [60.4, 75.6]

Interpretation: The confidence intervals overlap substantially in range [64.4, 75.6].

The 4-point difference (72 vs 68) could easily be measurement noise, not real change.

Statistical test:

To claim Year A truly differs from Year B: Non-overlapping CIs required: Lower bound of A > Upper bound of B 64.4 > 75.6 → FALSE Or: Upper bound of B > Lower bound of A 60.4 > 79.6 → FALSE Conclusion: Difference is not statistically distinguishable from zero

To claim Year A "outperformed" Year B with confidence, we'd need non-overlapping intervals—typically requiring differences of 15+ points, not 4.

Source: Schuwirth & van der Vleuten (2011), pp. 11-13; Classical Test Theory

[7] Attribution error in inter-year comparison

Attributing score changes to teacher quality commits the "fundamental attribution error"—overweighting individual factors while underweighting situational factors.

Variance decomposition in student achievement shows:

σ²_total = σ²_student + σ²_teacher + σ²_school + σ²_error Typical proportions (from education research): σ²_student ≈ 60-70% (prior ability, motivation, home support) σ²_teacher ≈ 10-15% (teacher quality) σ²_school ≈ 5-10% (school resources, peer effects) σ²_error ≈ 10-20% (measurement noise)

Implication: Even if we could isolate teacher effects (we can't—see Note 5), they explain only 10-15% of variance.

Numerical example:

Observed score variance = 100 points² Variance explained by: Students: 65 points² (65%) Teachers: 12 points² (12%) School: 8 points² (8%) Error: 15 points² (15%) If we attribute all 100 points² to teachers: Overestimation = 100/12 = 8.3× We're inflating teacher impact by 830%

Better approaches:

  • Value-added models: Track the same teacher across multiple cohorts with statistical controls for student intake ability
  • Direct observation: Lesson observations, student work analysis, peer review
  • Within-teacher comparison: Compare student growth within one teacher's class

Source: Hattie (2009) Visible Learning; educational effectiveness research literature

[8] Statistical power in small samples

Class-level comparisons suffer from small sample size (n ≈ 30-40 students per class).

Standard Error of the Mean (SEM) = SD / √n For n=35, SD=10: SEM = 10/√35 = 10/5.916 = 1.69

Power analysis: To detect a "true" 5-point difference at p<0.05 with 80% power:

Required sample size per group: n = (Z_α/2 + Z_β)² × 2σ² / Δ² where: Z_α/2 = 1.96 (for α=0.05, two-tailed) Z_β = 0.84 (for 80% power) σ = 10 (standard deviation) Δ = 5 (effect size to detect) n = (1.96 + 0.84)² × 2×10² / 5² = 7.84 × 200 / 25 = 62.7 ≈ 63 students per group

We're underpowered by half. With n=35, we can only reliably detect differences of ~7 points or more.

Minimum detectable difference with n=35:

Δ_min = (Z_α/2 + Z_β) × σ × √(2/n) = 2.8 × 10 × √(2/35) = 2.8 × 10 × 0.239 = 6.69 points With only n=35, differences smaller than ~7 points cannot be reliably distinguished from chance

Practical consequence: Most year-on-year differences (typically 2-3 points) are statistically indistinguishable from zero.

Acting on them is acting on noise, not signal.

Source: Standard power analysis; Cohen (1988) Statistical Power Analysis for the Behavioral Sciences

Summary for Leadership

Inter-year comparisons fail mathematically because:

  1. Under-identification: Confounding variables cannot be isolated (Note 5)—this is solving 1 equation with 5 unknowns
  2. Measurement error: Error is too large relative to signal (Note 6)—confidence intervals overlap completely
  3. Insufficient power: Sample sizes are too small for reliable inference (Note 8)—we can't detect differences smaller than 7 points
  4. Content sampling: Different papers test different constructs (Note 4)—only 16-25% shared variance
  5. Attribution error: Teacher effects are only 10-15% of total variance (Note 7)—we're overestimating impact by 8×

These are not "challenges to overcome"—they are mathematical impossibilities.

The right tools exist for assessing teaching quality: within-cohort value-added analysis, direct classroom observation, and curriculum alignment reviews.

Inter-year grade comparison is not one of them.

Load Previous State (Optional)

Import a previously exported JSON file to restore teacher assignments and student groupings.

Click or drop session JSON file here

Import teacher and student data from a previous session.

Step 2.1: Upload RES_106

Load the Subject Ranking (RES_106) Excel file from School Cockpit Plus. Data stays on this device.

Click or drop RES_106 Excel file here

Only .xlsx files from SC+ RES_106 report are supported.

No file loaded

How to download the RES_106 file

Login into School Cockpit Plus https://schoolcockput.moe.gov.sg (VPN needed if working outside school).

Under SC Applications, select CSR → Subject Analysis for Primary → Ranking.

Choose P6 level and choose all Subjects.

Choose Year I as current P6 year, Year II as (n-1) and Year III as (n-2).

Choose Results Type I as Prelim, Type II/III as End-of-year Exam.

Unstacked Student Summary

No RES_106 file loaded.

Step 2.2: Upload PSLE

Load the PSLE Results Excel file from MOE systems. This contains AL bands for each subject.

Click or drop PSLE Excel file here

Only .xlsx files from MOE PSLE results are supported.

No file loaded

About PSLE Results File

The PSLE file contains AL bands (Achievement Levels) for each student:

  • English Language - Standard (AL 1-8) or Foundation (A/B/C → 6/7/8)
  • Mother Tongue - Standard or Foundation, plus Higher MTL grades (D/M/P)
  • Mathematics - Standard or Foundation
  • Science - Standard or Foundation
  • PSLE Score - Sum of all AL bands
  • Posting Group - Eligibility: 1 (best), 2, or 3

Merged Student Data (PSLE + RES_106)

No PSLE file loaded.

Step 2.3: Subject Teachers (Optional)

Assign teachers to subjects for each class. This allows filtering by teacher in analysis modules. Skip if not needed.

Step 2.4: Students (Optional)

Configure ability grouping for regular classes and assign students to foundation/MTL groups.

Step 2.5: Student Past Data (Optional)

Add historical class and teacher data from P4 and P5.

Past Classes

Track which classes students were in during P4 and P5.

Click or drop filled past classes template here

Upload the completed Excel file with P4/P5 class assignments.

No file loaded

Past Teachers

Record which teachers taught each subject during P4 and P5.

Click or drop filled past teachers template here

Upload the completed Excel file with P4/P5 teacher assignments.

No file loaded

Item 1.6: Finalise

Finalize your setup and enable analysis modules.

Setup Complete

All required data has been loaded. Click "Finalize Setup" to enable the analysis modules (Item 3 and onwards).

Item 2: PSLE Data Analysis

View course eligibility statistics based on PSLE posting groups.

Loading PSLE data...

Item 3A: Subject Analysis by Class

View AL distribution for each subject by class (e.g., 6 COURAGE, 6 HARMONY).

Assessment

Subjects

Classes

Chart Mode

Select filters above to view results.

Item 3B: Subject Analysis by Pull-out Group

View AL distribution for each subject by pull-out groups (foundation, MTL groups).

Assessment

Subjects

Groups

Chart Mode

Select filters above to view results.

Item 4.1: Past PSLE Data

Upload historical PSLE data for trend analysis. Each file will be analyzed separately like Item 3.

PSLE (Loading...)

Click or drop PSLE Excel file here

Upload PSLE data for the previous cohort (GradYear-1).

No file loaded

PSLE (Loading...)

Click or drop PSLE Excel file here

Upload PSLE data for two cohorts ago (GradYear-2).

No file loaded

About PSLE Results File

The PSLE file contains AL bands (Achievement Levels) for each student:

  • English Language - Standard (AL 1-8) or Foundation (A/B/C → 6/7/8)
  • Mother Tongue - Standard or Foundation, plus Higher MTL grades (D/M/P)
  • Mathematics - Standard or Foundation
  • Science - Standard or Foundation
  • PSLE Score - Sum of all AL bands
  • Posting Group - Eligibility: 1 (best), 2, or 3

PSLE Analysis: GradYear-1

No PSLE file loaded.

PSLE Analysis: GradYear-2

No PSLE file loaded.

Item 4.2: 3-Year Trend Analysis

Compare PSLE performance across three cohorts.

Item 4.3: Correlation of ALs

View correlation matrices showing how PSLE and Prelim ALs relate for each subject.

Loading correlation analysis...

Item 4.4: Flow of Results

Visualize individual student performance trajectories across P4 EOY, P5 EOY, P6 Prelim, and PSLE.

Assessment

Subjects

Classes

Loading flow visualization...

Item 4.5: Flow of Cohort Standing

Visualize individual student cohort rank trajectories across P4 EOY, P5 EOY, P6 Prelim, and PSLE.

Assessment

Subjects

Classes

Visualization Enhancements

Loading rank flow visualization...

Item 4.6: P5 Predictability

Evaluate how well P5 EOY performance predicts P6 Prelim ranks and PSLE AL scores.

Classes

Loading predictability analysis...

Item 4.7: Correlations

Create pairplot matrices to explore correlations between any subject@assessment@scoretype combinations.

Classes

Horizontal Axis Variables

Vertical Axis Variables

Add variables to both axes to generate the correlation matrix.

Export State

Export your complete session data to save your work.

Save Your Progress

Export your complete session including all loaded data (RES_106, PSLE, teacher assignments, student groupings, and past data if provided).

You can import this file in "Load Previous State" next time to restore your entire configuration.