Newsroom

March 18, 2026 | Can AI Audit Human Research Coding? What We Learned (and What Broke)

One of the most time-consuming parts of rigorous literature synthesis is establishing inter-rater reliability (Belur et al., 2018). In our project, coders must interpret the same research in consistent ways, which requires detailed, field by field comparison of their coding sheets. Until now, that work has mostly been manual and slow.

Over the past several months, we tested whether AI systems could help. Not by coding research papers themselves, but by auditing how humans coded them. We ran two rounds of experiments using different models and different prompts. The results were useful in some places and disappointing in others. Read More.

March 12, 2026 | Our inter-rater reliability started off terribly.

For one pair of reviewers, Krippendorff’s α was .184 after 10+ rounds of review and we realized fully independent coding from the start produced wildly inconsistent results because the studies themselves are so complex. One paper might include multiple research questions, several models, and dozens of experimental configurations (Kim & Jo, 2024). When coders interpreted the paper structure differently, everything downstream diverged.

We addressed this with a two-stage coding process. In Stage 1, coders completed the portions of the coding sheet covering research questions, models, baselines, and metrics. These were the areas where most disagreements occurred and where early differences would cascade through the rest of the coding. For example, if coders disagreed about the models and baselines, disagreements would follow for metric values and baseline comparisons.

After Stage 1, coding pairs met briefly to align on their interpretation of the paper before proceeding. In Stage 2, they coded the remaining fields independently. That adjustment transformed reliability. Every pair showed an improvement in inter-rater reliability.

We have updated our OSF preregistration with the full protocol and documentation of the revised coding process: https://osf.io/ar3yg/overview.

This project is ongoing, and we are continuing to refine our approach as we learn from the data and from each other. If you are interested in following the process, subscribe to our mailing list for monthly updates.

For those who have worked on systematic reviews or research syntheses, I would love to hear your experience. What reliability challenges have you run into, and what helped resolve them?

March 2, 2026 | New Report Finds AI Tools 'Not Ready Yet' for Frontline Education Use, Despite Surging Demand

Learning Data Insights market research reveals critical infrastructure gaps holding back responsible AI adoption in EdTech

A new market research report from Learning Data Insights finds that while demand for AI-powered education technology is surging, most tools available today still fall short of the quality standards required for responsible classroom use. The report, Not Ready Yet: AI Infrastructure for EdTech Market Research, is based on 15 interviews with 22 key stakeholders across Digital Learning Platform (DLP) providers and R&D teams in the Learning Engineering Virtual Institute, along with a review of 26 reports and documents on AI and education.

“It’s easy to get something from GenAI,” said one EdTech leader with direct experience working with state-of-the-art AI tools. “But when you’re responsible for what ends up in front of a student, the gap between ‘interesting’ and ‘professionally defensible’ is still huge.”

The research surfaces a recurring pattern across EdTech teams: product leaders feel pressure to launch AI features to stay competitive, even as their quality assurance repeatedly flags outputs as below the quality bar of their existing curriculum and assessment tools. Several teams described pausing or rolling back AI pilots after discovering that reviewing and correcting generated content took more time than producing it manually.

"We are seeing a rush into content generation because it’s easy, obvious, fits into spreadsheet calculations for existing business models, and can appear to be low-risk," noted one DLP executive. Others warned that evaluating AI-generated content can take longer than simply creating that content by hand.

Five Critical Infrastructure Gaps Holding Back AI-Enabled Classroom Use

The report identifies five areas where targeted infrastructure investment could unlock higher-quality, more equitable AI deployment in education:

Evaluation & quality assurance tools: Current benchmarks and automated assessment methods are underdeveloped, leaving teams reliant on slow, expensive manual review. Several respondents described QA as the primary bottleneck preventing student-facing deployment.
Privacy & security solutions: Data protection concerns are among the biggest blockers for AI deployment, especially when student data, including audio and video, is involved.
Contextualization & implementation frameworks: LLMs frequently lack knowledge of a student’s background, learning history, and curriculum context, dramatically reducing the relevance of their outputs.
Classroom-optimized Automated Speech Recognition (ASR): Current ASR systems are not designed for the acoustic and linguistic realities of K–12 classrooms, blocking a wide range of promising audio-based applications.
Training & AI literacy programs: The rapid pace of AI change is outrunning educators’ ability to develop the expertise needed to use these tools effectively and critically.

A Phased Path toward Responsible AI

Interviewees consistently described a phased approach to AI adoption: teams start with internal uses such as quality assurance and editorial workflows, expand cautiously to teacher-facing tools, and reserve student-facing applications for last. This sequencing reflects the higher stakes of direct student interaction, where reliability, safety, and trust failures are harder to detect and far more costly. As a result, many providers view student-facing AI not as a near-term feature launch, but as a longer-term goal contingent on significant infrastructure improvements.

Complementary Roles for Philanthropy and Frontier Model Providers

The research points to a clear division of labor. Philanthropic funders are well positioned to invest in public goods such as evaluation standards, equity focused datasets, implementation guidance, and data annotation infrastructure that are essential for responsible AI use but unlikely to emerge through market forces alone. Frontier model providers, meanwhile, can contribute technical expertise, usage guidance, and insight into emerging capabilities. Interviewees stressed that neither sector can address the full range of infrastructure gaps independently.

The report underscores that AI implementation in education cannot be separated from equity. Without deliberate infrastructure choices, AI systems risk reinforcing existing disparities rather than improving learning outcomes, particularly for students from low-resourced families who are least represented in current data and development pipelines.

About the Report

Not Ready Yet was authored by Alexis Andres and John Whitmer of Learning Data Insights (LDI) and draws on interviews with EdTech leaders, product teams, and researchers across the sector. The research was conducted with the support of the Walton Family Foundation. The full report, executive summary, and presentation slides are available at https://osf.io/preprints/edarxiv/ngbkv_v1.

Media Contacts

Alexis Andres, Learning Data Insights
alexis@ld-insights.com
John Whitmer, Learning Data Insights
john@ld-insights.com

###

March 2, 2026 | We asked four researchers with PhDs to code the same paper, expecting the process to be straightforward.

Instead, they disagreed on a basic question: how many AI models were tested. One said two, another said three, and two said six. All of those answers were defensible based on the way the paper was written.

This moment highlighted something we had not fully anticipated: even seemingly simple study features can be difficult to classify consistently.

The GenAI Evidence Hub is a structured review of more than 250 studies on generative AI for educational assessment. We are not just reading papers. We are building and stress testing a shared framework for interpreting what those papers report and how their claims should be evaluated.

As we continuously refine our coding process, several patterns have emerged. Terminology is inconsistent, with “model” used to mean a base architecture, a fine tuned variant, or a prompting configuration depending on the author. Key results are often buried, as when one paper reported “moderate agreement” for one of six features while glossing over the rest. And results vary widely, with the same model on the same task producing very different outcomes depending on the domain, dataset, or evaluation method.

If experienced researchers struggle to agree on basic study features, it becomes much harder for practitioners to assess vendor claims with confidence.

We are thirteen rounds into calibrating our coding framework and have revised our protocol three times. We are sharing all of it, including what did not work, because that process is part of the evidence.

For researchers, our preregistration, coding instruments, and full methodology are available on OSF.

For everyone else, you can read the welcome blog and see what we are building.

February 24, 2026 | We’re building something the education community urgently needs, and we’re doing it in public.

The GenAI Evidence Hub for Educational Assessment is a project to systematically analyze more than 250 studies examining generative AI use for education. We will examine research in automated scoring, formative feedback, and item generation to understand what actually works, what doesn’t, and what the evidence shows. While there's a lot of AI hype in the market, we’re also finding a treasure trove of research that can help technology developers evaluate the robustness of their solutions and practitioners make informed decisions about what is working and what questions they should ask. We also hope to provide suggestions and resources for researchers to understand new methods being used and build up our understanding of how to do research in this rapidly changing field.

The hub will go beyond summarizing findings to systematically coding the methodological details that matter for validity: which models were tested, under what conditions, against which baselines, and using which metrics. We will also document disagreements, false starts, and unanswered questions so the limits of the research can be seen and built upon.

We are doing this work using open science practices because we believe that progress accelerates when researchers show their work in progress. Behind every clean conclusion are many ideas that did not pan out, and sharing those paths is part of how fields actually move forward.

This project is philanthropically supported, and all outputs will be open access.

📖 Read our welcome blog.

📬 Subscribe to The Hubdate for monthly progress updates.

February 13, 2026 | The Model Counting Problem: Moving toward Consistency in AI Research Analysis

As we began designing our coding sheet for the GenAI Evidence Hub, our coding team hit a question that looked simple and turned out to be foundational: How many models did this study actually test? Read More.

January 9, 2026 | Announcing the GenAI Evidence Insights Hub

Welcome to the GenAI Evidence Insights Hub.

In this blog, we will openly document the complex, imperfect, and often challenging work of building a robust evidence base for Generative AI (GenAI) in education assessment. We are doing this work in public because we believe that GenAI has tremendous promise for these applications. The flexibility of this new technology and speed at which innovations can be created calls for increased measurement – to make sure that learners & teachers have high-quality materials and to distinguish research-backed results from “AI slop.” There is a proliferation of research being conducted to help us to use methods that work and identify new areas of practice. Read More.

Page updated

Report abuse