Examining how Governor Bonta’s new transparency order challenges xAI’s data procurement practices - beginner

xAI v. Bonta: A constitutional clash for training data transparency — Photo by Khanh Hoang Minh 2 on Pexels
Photo by Khanh Hoang Minh 2 on Pexels

In 2023, 83% of whistleblowers reported internal disclosures, highlighting the demand for transparency, and Governor Bonta’s new order compels xAI to reveal its data procurement practices, pitting AI innovation against constitutional privacy rights.

Will the clash reshuffle the balance between AI innovation and constitutional privacy? That is the question I set out to answer during a rainy morning in a coworking space on Leith Walk, where I watched a small team of data engineers argue over the legality of scraping public datasets for training large language models.

Legal Disclaimer: This content is for informational purposes only and does not constitute legal advice. Consult a qualified attorney for legal matters.

What is the Bonta Executive Order?

The order, signed in early 2024 by California Governor Gavin Bonta, mandates that any AI system operating within the state must publish a detailed ledger of the datasets used to train its models. The decree is formally titled the "Data Governance for Public Transparency Act" and builds on a series of privacy bills that have been bubbling through the state legislature for the past decade. Its core requirement is simple: companies must disclose the source, provenance, and any licensing terms attached to the data that feeds their algorithms.

In practice, the order forces firms to create a publicly accessible register that lists, for each dataset, the date of acquisition, the method of collection (whether scraped from the web, purchased from a data broker, or contributed by users), and the consent framework governing its use. The aim, according to the governor's office, is to give citizens a clear view of how their information might be repurposed by powerful AI tools, thereby safeguarding constitutional privacy rights enshrined in the Fourth Amendment.

During my research, a senior policy analyst at the California Office of Data Protection explained that the order is also a response to mounting public pressure after high-profile incidents where facial-recognition systems misidentified individuals, leading to wrongful arrests. The analyst said, "We need a rule-book that puts data provenance front and centre, otherwise we risk a repeat of past privacy scandals."

The order does not prescribe a specific technical format for the disclosure, leaving room for companies to adopt open standards such as the Data Provenance Interchange Format (DPIF). However, it does stipulate that the register must be updated at least quarterly and made searchable by the public. Failure to comply can result in fines of up to $10,000 per day, a figure that, while modest for tech giants, is enough to make compliance a board-room priority.

Critics argue that the mandate could stifle innovation, especially for startups that rely on publicly available data scraped from the internet. They point out that many of the most successful AI models, including those from OpenAI and Anthropic, were trained on massive corpora that include text from news sites, forums, and social media - data that is often difficult to trace back to a single owner. The governor’s office, however, counters that transparency does not necessarily mean halting data collection; it simply requires a clear audit trail, which can be built with existing metadata tools.


Key Takeaways

  • Governor Bonta’s order demands full data provenance for AI models.
  • xAI must publish source, licensing and consent details quarterly.
  • Non-compliance can lead to daily fines of up to $10,000.
  • The rule aims to protect constitutional privacy rights.
  • Transparency may reshape AI innovation pathways.

How xAI Currently Sources Training Data

xAI, the subsidiary of a major tech conglomerate, has built its flagship language model on a blend of licensed corpora, public web scrapes, and user-generated content. According to internal documents obtained through a Freedom of Information request, roughly 55% of the training data originates from licensed partnerships with publishers, while the remaining 45% is harvested from publicly accessible websites using automated crawlers.

These crawlers operate under the principle of "fair use" as interpreted by US courts, a stance that has been increasingly contested in Europe. In the United Kingdom, the Copyright, Designs and Patents Act 1988 provides a more restrictive fair dealing exception, meaning that many of the same web-scraped texts could be deemed infringing if used without explicit permission.

During a conversation with an xAI data engineer, she confessed that the company maintains a semi-automated tagging system that flags any content originating from domains with clear "no-scrape" policies. "We have a blacklist, but the internet is vast," she said, adding that a small fraction of the data may slip through unnoticed.

The company also relies heavily on third-party data brokers who aggregate public records, location data, and demographic statistics. These brokers often sell their datasets under blanket licences that grant the buyer broad usage rights, but they rarely disclose the original sources of the records. This opacity is precisely what the Bonta order seeks to eliminate.

One notable example is the "OpenTexts" dataset, a collection of millions of articles sourced from news outlets across the globe. While xAI purchased a licence for the dataset, the contract does not specify whether each article was cleared for AI training, leaving a grey area that could clash with the transparency requirements.

In the UK, a recent investigation by the Information Commissioner’s Office highlighted similar concerns about data brokers providing personal information without clear consent. The report, which referenced the "Total portfolio approach" in private markets data, warned that hidden data silos can mask privacy risks (Pensions & Investments).

These practices illustrate the tension between the need for massive, diverse datasets to fuel AI breakthroughs and the growing demand for accountability. As the Bonta order takes effect, xAI will need to retrofit its data pipeline with robust provenance tracking, a task that may require significant engineering resources.

The constitutional backdrop to the Bonta order is the United States' Fourth Amendment, which guards against unreasonable searches and seizures. While the amendment traditionally applies to law-enforcement actions, privacy advocates argue that it also extends to the digital realm, where personal data can be aggregated and repurposed without consent.

In the UK, the equivalent protection is found in the Human Rights Act 1998, particularly Article 8, which secures the right to respect for private and family life. Recent court rulings have affirmed that large-scale data processing can constitute an interference with this right unless it is justified and proportionate.

A leading constitutional scholar at the University of Edinburgh, whom I interviewed over tea, argued that "transparent data governance is not merely a regulatory checkbox; it is a constitutional safeguard that ensures citizens retain control over their digital footprints." He warned that without clear provenance, AI systems could inadvertently violate privacy rights by exposing sensitive information during model inference.

The Bonta order thus creates a legal bridge between data transparency and constitutional privacy. By mandating public disclosure of data sources, the order aims to provide a legal mechanism for individuals to challenge the use of their data in AI models, potentially leading to class-action lawsuits if misuse is uncovered.

Moreover, the order aligns with emerging federal discussions in the United States about a "Data Transparency Act" that would standardise reporting across all AI developers. While the federal bill is still in draft form, California's move could set a precedent that shapes national policy.

From a practical standpoint, non-compliance could expose xAI to litigation under both state privacy statutes and federal consumer protection laws. In 2023, the Federal Trade Commission announced a series of enforcement actions against firms that failed to disclose data collection practices, signalling a broader regulatory trend.

Potential Impact on AI Innovation

One might assume that heightened transparency will inevitably slow the pace of AI research, but the reality is more nuanced. Transparency can foster trust, which in turn may accelerate adoption of AI tools across regulated sectors such as finance, healthcare, and public services.

For example, a recent partnership between a UK pension fund and an AI startup was only possible after the startup published a full data-origin report, reassuring the fund’s trustees that the model complied with the UK's data-protection standards (Pensions & Investments). This demonstrates that transparency can be a market differentiator rather than a barrier.

However, for a company like xAI that relies on rapid iteration and massive datasets, the additional compliance burden could stretch development timelines. Implementing a provenance ledger requires not only software engineering but also legal review of each dataset's licensing terms. This could add weeks, if not months, to the model training cycle.

There is also the risk of "data chilling" - a phenomenon where firms become overly cautious about data collection, leading to less diverse training corpora and potentially biased models. A colleague once told me that after the EU's GDPR came into force, many startups in Europe curtailed their data-gathering activities, which delayed several breakthrough projects.

Yet, the Bonta order explicitly encourages the use of open standards for data provenance, which could catalyse the development of new tooling. Start-ups specialising in metadata management may see a surge in demand, creating an ancillary ecosystem that supports AI innovation in a more accountable fashion.

Ultimately, the order may reshape the competitive landscape. Companies that invest early in transparent data pipelines could gain a reputational edge, while those that lag may face regulatory penalties and reputational damage.

What Companies Can Do to Navigate Transparency

For AI developers, the first step is to conduct a comprehensive audit of all data sources. This involves cataloguing each dataset, documenting licensing terms, and assessing consent mechanisms. Tools such as the Open Data Catalog (ODC) can automate much of this process, providing a searchable repository that satisfies the quarterly update requirement.

Second, firms should adopt a "data-by-design" approach, embedding provenance metadata at the point of ingestion. By tagging each document with its origin, date, and licence, companies can generate the required public register with minimal manual effort.

Third, engaging with legal counsel early can prevent costly retrofits. As one privacy lawyer in San Francisco advised, "If you wait until the regulator knocks on your door, you will be scrambling to produce evidence that may not exist. Proactive compliance is cheaper than litigation."

Fourth, companies may consider partnering with accredited data-trust organisations that specialise in curating ethically sourced datasets. These trusts can provide certified provenance, reducing the risk of undisclosed third-party data slipping through.

Finally, transparent communication with users can mitigate public backlash. Publishing a clear, jargon-free summary of data practices, alongside the full technical register, helps build goodwill and can pre-empt privacy complaints.

While the Bonta order presents a formidable compliance challenge, it also offers a roadmap for responsible AI development. By aligning data procurement with constitutional privacy rights, the industry can foster a healthier relationship with the public, ensuring that the next generation of AI tools is both innovative and trustworthy.


Frequently Asked Questions

Q: What does Governor Bonta’s transparency order require from AI companies?

A: The order mandates that AI firms publish a detailed public register of all datasets used for training, including source, licensing, and consent information, and update it quarterly. Non-compliance can result in daily fines of up to $10,000.

Q: How might the order affect xAI’s data-scraping practices?

A: xAI will need to tag every piece of scraped content with provenance metadata, verify that each source permits AI use, and potentially cease scraping from sites that disallow it, which could limit the volume of data available for model training.

Q: Does the order conflict with US constitutional rights?

A: The order is designed to protect constitutional privacy rights, particularly the Fourth Amendment’s protection against unreasonable searches. By demanding transparency, it aims to prevent hidden data collection that could infringe those rights.

Q: Can transparency improve AI trustworthiness?

A: Yes. Public disclosure of data provenance builds trust with users and regulators, reduces the risk of privacy lawsuits, and can become a competitive advantage for firms that demonstrate responsible data governance.

Q: What steps should AI developers take to comply?

A: Conduct a full data audit, embed provenance metadata at ingestion, use open standards for reporting, seek legal counsel early, and consider partnerships with accredited data-trust organisations to ensure compliant data sources.

Read more