Collecting Evidence

Deciding on a sample size: How much data is enough?

The goal of assessing program learning outcomes is to inform potential improvements to the program, for the benefit of future students. So when we draw conclusions based on these assessments, we are generalizing from a sample of current students to the population of students who could potentially enroll in the program in the near future. Ultimately, your goal should be to gather a sample that will provide strong enough evidence to justify potential changes to your program – acknowledging that costlier changes require a bigger sample size and higher standard of evidence.

Practical sampling strategies

Here are some sampling strategies to consider, taking into account the size of your academic program. The sample sizes suggested here would be appropriate to inform moderately time-consuming programmatic changes. A more in-depth rationale behind the suggested sample sizes can be found at the bottom of this page.

Program ContextSampling Strategy
Smaller programs (<10 graduates per year) Programs with small numbers of graduates each year are best served by sampling all student work from a given assignment over a four-year span, aiming for a sample of 30-40 pieces of student work.
Larger programs (>10 graduate per year)

Simple random sampling: this strategy involves defining a “sampling frame” – i.e., all the student work from a given assignment that addresses an outcome over a given time frame – and then randomly choosing a sample from that body of student work. For example, a program might have a capstone research paper assignment that addresses a written communication outcome, with 60 senior papers available from the past two years. From these 60 papers, faculty might select a simple random sample of 30-40.

Stratified random sampling: Larger programs that have access to data on students’ demographic characteristics and/or academic performance may want to take stratified random samples of student work. With this approach, step one is grouping students into categories – e.g., by race/ethnicity, or by tiers defined by GPA – and step two is drawing random samples from each category, such that they add up to the desired total sample size. Depending on the goals of the assessment, these samples could be drawn two different ways:

  • Proportional sampling: the amount of student work drawn from each category is proportional to the relative size of the category – e.g., if 40% of seniors in a program identify as male, then 40% of work included in the final sample will come from male students. The advantage of proportional sampling is that it reduces the variance in assessment results due to the categories used (such a race/ethnicity).
  • Oversampling: the number of samples drawn from some categories is disproportionately large compared to their size. Programs may want to oversample, for instance, when assessing the learning of students from minoritized groups who make up a relatively small proportion of the program’s population.

If your program decides to take a stratified random sampling approach, please reach out to our office at and we can provide guidance on constructing these samples.

Rationale behind sample sizes: Margin of error

When considering how big this sample should be, consider the overall question that guides the inquiry, and how the results will be applied. If you are planning a programmatic overhaul, then you’ll probably want a larger sample with a lower margin of error. (The margin of error represents the degree of uncertainty in a quantitative estimate – such as the percentage of students meeting a program’s standards for an outcome – due to sampling from a larger population. A useful tutorial and calculator for the margin of error can be found here.) For all but the largest programs, in practice this would mean gathering work from all or most graduating seniors over the course of several years.

However, the larger the sample, the greater the commitment of time required from faculty. A smaller sample – e.g., 30 pieces of student work – is acceptable if the results will be used to inform minor curricular adjustments. For smaller programs that have fewer than 10 students per cohort, a good practice is to gather direct evidence from all seniors over a three- or four-year span. And regardless of the size of the program, quantitative statistics (e.g., the percentage of student projects scoring at least proficient on a dimension of an outcome, or an average score on such a dimension) are best used in combination with qualitative evidence regarding student experiences and teaching approaches in the program, which can illuminate the practices underlying student performance.