Training Datasets | algorithms.technology

The Human Rights Watch investigation

In June and July 2024, Human Rights Watch (HRW) published findings from investigations into LAION-5B, a dataset containing 5.85 billion image-caption pairs scraped from the public internet and used to train popular AI image generators including Stable Diffusion.

HRW researchers manually reviewed a sample of just 5,850 image links, less than 0.0001% of the total dataset. In that tiny sample, they initially found 170 Brazilian children (June 2024) and 190 Australian children (July 2024). By September 2024, continued investigation had raised the confirmed totals:

362

Identifiable Australian children confirmed by September 2024

<0.0001%

Of the 5.85 billion images were sampled

358

Brazilian children also confirmed in the same period

What the researchers found

The photos captured every stage of childhood:

Newborns, including photos of children being born
Preschoolers at childcare centres and preschools, with full identifying details
Primary school children at Book Week events, swimming carnivals, school activities
Girls in swimsuits
First Nations children from Anangu, Arrernte, Pitjantjatjara, Pintupi, Tiwi, and Warlpiri peoples

A specific example

"Two boys, ages 3 and 4, grinning from ear to ear as they hold paintbrushes in front of a colourful mural." Human Rights Watch, July 2024

The accompanying caption in the dataset revealed both children's full names and ages, and the name of the preschool they attend in Perth. Anyone with access to the dataset could identify these children, know where they go to school, and know what they look like.

Where the photos came from

The sources of the children's photos included:

Personal blogs and photo-sharing sites
School uploads
Professional family photographers' websites
Some photos were no longer publicly discoverable via search engines, but had already been scraped before being taken down
One video from YouTube with "unlisted" privacy settings was scraped despite the platform's own prohibitions

School uploads were a source

HRW explicitly identified school uploads as one of the sources of children's photos in the dataset. This is not a hypothetical risk. Children's photos posted by schools on public platforms have been confirmed to end up in AI training datasets.

Once scraped, it cannot be undone

In December 2023, after the Stanford Internet Observatory found over 1,000 verified instances of child sexual abuse material in LAION-5B, the dataset was taken offline. LAION released a cleaned version called Re-LAION-5B in August 2024, which removed CSAM links and the children's photos identified by HRW. But HRW was clear about the limitation:

"AI models that were trained on the earlier dataset cannot forget the now-removed images." Human Rights Watch, September 2024

AI models don't store individual photos. They learn patterns from millions of images during training. Once a child's photo has been processed into a model's weights, there is no way to extract or delete it. The patterns learned from that child's face, body, and identifying information are permanently embedded in every model trained on the dataset.

This means:

Deleting the photo from Facebook doesn't remove it from models already trained
LAION removing the link from its index doesn't affect models already trained on the older version
There is no mechanism to "un-train" a model on specific images
The only protection is prevention: not making the photo publicly accessible in the first place

LAION blamed parents

When confronted with the findings, LAION's response was to shift blame to families:

"Any information obtained by Human Rights Watch is publicly available, though for some reason unknown to us, they would like to pretend it is not." LAION, in response to Human Rights Watch, 2024

And further:

Parents should "behave responsibly and not post private sensitive data related to their children on [the] public internet, where it can be easily collected." LAION, 2024

The organisation that scraped billions of images, including photos of three-year-olds at preschool, says parents should have known better. Many of these photos were not even posted by parents. They were posted by schools, in good faith, using standard processes.

The tech companies built the systems, scraped the data, and then blamed families. That is why every parent and every school needs to understand what is happening, so we can protect our children together.

The academic evidence

The largest collection of children's public images

Researchers from the University of Utah and Carnegie Mellon University analysed approximately 18 million Facebook posts by US schools and school districts and found:

An estimated 4.9 million posts included identifiable images of students
Approximately 726,000 posts also included students' first and last names and approximate location

"The posts we studied may represent the largest existing collection of publicly accessible, identifiable images of minors. It is likely that the photos are being accessed by a range of actors, including government agencies, predictive policing companies, and those with nefarious intent." Rosenberg et al., University of Utah / Carnegie Mellon University

Violating children's rights

A 2024 peer-reviewed study published in Computers and Education Open found that schools across the UK, US, Australia, and Europe are publishing children's images online without adequately protecting their rights, including:

Violating Article 16 of the UN Convention on the Rights of the Child: protection against arbitrary interference with privacy
Violating Article 12, the right to be heard. Children are excluded from decisions about their own images.
Violating Article 3, the best interests standard. Consent forms provided to parents contain minimal information about potential harms.

What has happened since

The problem has not slowed down. If anything, the evidence of harm has escalated:

Amazon found child sexual abuse material in its AI training data (January 2026). Amazon reported hundreds of thousands of suspected CSAM items found in external data gathered for AI model development, but refused to disclose the source.
Tennessee teenagers sued xAI over AI-generated CSAM (March 2026). Three high school students filed a class action alleging that xAI's Grok tool was used to generate sexually explicit images of them from ordinary school photos.
Brazil passed a landmark child protection law (September 2025). The ECA Digital requires technology companies to design products with children's best interests in mind and provide the highest levels of privacy by default, with fines up to 10% of Brazilian revenue.
Australia's Children's Online Privacy Code is in progress. The OAIC released an exposure draft on 31 March 2026 for public consultation, with the Code due by 10 December 2026. HRW has called for the Code to prohibit scraping children's personal data for AI.

HRW's recommendations

Human Rights Watch recommended that the Australian Government:

Adopt child data protection laws (the Children's Online Privacy Code, now in exposure draft)
Prohibit scraping children's personal data into AI systems
Ban non-consensual digital replication of minors' likenesses
Provide mechanisms for harmed children to seek justice

Until those protections exist, the only defence schools have is to stop making children's photos publicly accessible.

Sources

← Deepfakes · All evidence · No Consent →