How Safe Is AI for Kids? We Tested MyDD.ai with KORA, an Independent Benchmark
March 11, 2026

By Jake Rozran (MyDD.ai) & the KORA Team
When we say MyDD.ai is safer than ChatGPT for kids, we want to back that up with evidence, not just marketing. So we independently verified our product on an open-source child safety benchmark. Here’s what we found, including where we fell short.
Why We Ran KORA
MyDD.ai is a generative AI chatbot built specifically for children ages 6–17. Every conversation runs through age-appropriate content filtering, and every session is visible to parents (titles, on-demand summaries, and full transcripts when needed). We built it because kids are already using ChatGPT and other AI tools, and parents have no visibility into those conversations.
We’d invested heavily in safety from day one. Custom guardrails tuned by age group. Infrastructure to keep parents informed. COPPA-compliant onboarding with verifiable parental consent. We believed we’d built something that handled child interactions responsibly.
But belief isn’t evidence.
We discovered KORA through our network. It’s an open-source, standardized benchmark for measuring AI child safety, developed by the SALT Lab at the University of Illinois and Microsoft AI Research. KORA is nonprofit and self-funded. The benchmark exists to advance child safety, not to sell something. That independence is what makes the results credible.
The question was straightforward: where are the gaps we can’t see from the inside?
How We Ran It
We forked KORA’s open-source repo and adapted it to hit MyDD.ai’s custom API endpoint. Our first independent run required some assumptions about configuration that we later refined with guidance from the KORA team. The results in this post reflect that corrected run.
One important technical detail: MyDD.ai requires an age parameter for every session. Our API doesn’t allow ageless interactions. There’s no “default adult” mode. So we ran the full benchmark using age-aware child prompts, which is how our system actually operates in production.
What We Learned
Our overall score was 88.3%.
That number felt good on the surface. And in several categories, the results confirmed that our core safety architecture was working:
- Sexual content filtering: above 95%; zero failures
- Online safety: above 95%
- Bias and hate speech: above 95%; zero failures
These are the areas where most builders focus their safety testing, and for good reason. Failures here are the most visible and the most immediately harmful.
But KORA tested something we hadn’t rigorously measured: developmental risk.
Our score in that category was 61.9%.
What developmental risk actually looks like
This wasn’t about the chatbot saying something explicitly harmful. Developmental risk is subtler than that. It captures three distinct patterns:
- Substituting for a child’s intellectual effort: writing the essay, solving the puzzle, making the decision, rather than guiding the child to do it themselves.
- Responding at the wrong developmental level: using concepts, vocabulary, or framing that’s too abstract, too adult, or too complex for the child’s age, even when they ask for something simpler.
- Flattening complex questions into definitive answers: telling a child there’s one correct rule for friendship, one objectively best answer to a moral dilemma, one right way to fit in, when the developmental value is learning to sit with ambiguity.
If you’ve read our analysis of ChatGPT’s psychological safety patterns , you’ve seen how this works in practice. A response can say the right thing while doing the opposite. The words are fine; the emotional effect isn’t. That’s the territory developmental risk occupies.
This was our blind spot. We’d built robust filtering for the obvious failure modes: inappropriate content, unsafe instructions, privacy violations. We hadn’t systematically stress-tested for the ways an AI might subtly shape a child’s emotional and cognitive development through thousands of small interactions.
That’s the kind of gap you don’t find with internal QA or ad hoc red-teaming alone.
What Comes Next: Part 2
A benchmark score is only useful if you do something with it.
We’re making targeted changes to our system prompts and guardrails based on the KORA category breakdowns, particularly around how MyDD.ai handles developmental scenarios. Once those changes ship, we’re re-running the full benchmark and publishing the updated results.
In Part 2, we’ll share exactly what we changed, why, and whether the scores actually improved. If they did, that validates the approach. If they didn’t improve enough, that tells us we need deeper architectural changes, not just prompt tuning. Either way, we’ll publish the data.
The whole point of a benchmark like this is it gives you a feedback loop you can close.
What This Means for Builders
Most AI companies handle child safety through age gates. Add a minimum age to the Terms of Service, block users who self-report as under 13, move on. If you’re actually building AI products for children, edtech, tutoring, family apps, kids’ entertainment, that approach doesn’t hold up.
KORA turns “we think we’re safe” into “here’s the evidence, including where we’re not.” That shift matters for three reasons.
It finds what internal testing misses. Internal QA and red-teaming test for what you already think might go wrong. A structured benchmark tests across categories you may not have prioritized. In our case, developmental risk. KORA doesn’t replace your internal safety work. It complements it as part of a broader safety stack alongside red-teaming, user feedback loops, and ongoing monitoring.
Regulatory pressure is making this a requirement, not a best practice. In the US, COPPA 2.0 has passed the Senate, the KIDS Act, which includes the Kids Online Safety Act, has advanced out of the House Energy and Commerce Committee, the FTC finalized significant COPPA Rule amendments in 2025, and 42 state attorneys general have demanded that AI companies demonstrate how they protect children. The EU AI Act restricts AI that exploits children’s vulnerabilities and imposes strict requirements on AI used in educational decision-making. Running a benchmark like KORA now puts you ahead of compliance timelines rather than scrambling to catch up when enforcement begins.
It creates accountability you can point to. When parents, school administrators, or regulators ask “how do you know your product is safe for kids?” “We ran an independent, nonprofit benchmark and here are the results, including the categories where we’re still improving” is a fundamentally different answer than “we have safety guidelines.”
What This Means for Parents
If you’re a parent reading this, the takeaway is simpler: ask for the evidence.
When an AI product says it’s “designed for kids” or “kid-friendly,” ask how they know. Ask what they’ve tested against. Ask whether the results are public.
Marketing claims about child safety are easy to make. Benchmark results, broken down by category, with specific scores, are harder to fake. KORA’s upcoming apps and services leaderboard will give parents a way to compare AI products built for children based on standardized safety measurements rather than marketing copy.
An 88.3% means we handled the vast majority of child safety scenarios correctly. It also means roughly 1 in 10 test cases revealed room for improvement. That honesty, showing where you’re strong and where you’re working to get better, is what parents should expect from any company building AI for their kids.
If you want an AI chatbot that’s actually built for kids, and tested to prove it, try MyDD.ai free for 14 days. Plans start under $7/month billed annually.
Closing Thoughts
Building AI that’s genuinely appropriate for children is an ongoing process, not a checkbox. You don’t ship safety once and move on.
KORA gave us two things we didn’t have before: a concrete, category-level map of where our system needs work, and a standardized way to measure whether our improvements actually land. If you’re building AI products that children interact with, fork the repo, adapt it to your endpoint, and see what it finds. The categories where you score lowest are probably the ones you haven’t been testing for.
Part 2 where we share what we changed and whether it worked is coming soon.
Links:
- KORA Benchmark — open-source, run it yourself
- MyDD.ai — AI for kids, visibility for parents
Jake Rozran is CEO and co-founder of MyDD.ai. He’s a data scientist, adjunct professor at Villanova University, and father of two.
Kora is a non-profit research initiative creating the first independent benchmark to evaluate how safe AI systems are for children and teens.