Anonymized and Synthetic Data Clauses in AI Contracts

Anonymized and de-identified data carve outs used to be the part of the contract that nobody fought over. They sat at the back of the privacy section and let vendors do their analytics. That world is gone. Data is the fuel for AI, the scrapable internet has run out, and the data sitting behind your walls is now the asset your counterparty wants most. The clauses you used to mark up in two minutes are now the deal point.

How to Contract hosted Shannon Yavorsky, Co-Head of Orrick's Cyber, Privacy and Data Innovation practice and Co-Head of its AI practice, for the third installment in our AI contract clause series. Laura Frederick worked through AI-drafted definitions and obligations for anonymized data, de-identification, and synthetic data, and Shannon pressure tested each one from both the vendor and customer side. Her vantage point made the conversation useful because she sees the regulatory side, the technical side, and the commercial monetization fight at the same time.

The discussion covered why these provisions have become a battleground over data monetization and competitive advantage, the regulatory mismatch between privacy-era safe harbors and AI use cases, the gap between "industry standard" language and any actual standard, why synthetic data is rarely as clean as the contract claims, and the IP and quality questions the typical synthetic data clause leaves unresolved.

Here are our top ten takeaways from the speakers' comments during the webinar:

Treat the data carve out as the deal point, not a footnote. Vendors used to ask for anonymized data rights in the privacy section and nobody fought over it. That era is over. Data is the asset your counterparty wants, and anonymization or synthetic data carve outs are now a primary monetization route. If we are still drafting these provisions like they did not matter commercially, we are leaving real value on the table.
Anonymization is a regulatory term that does not translate cleanly to AI use cases. The GDPR and HIPAA standards were built for privacy compliance, not for training large language models. Re-identification risk in an AI context is materially higher than in the world the safe harbors were designed for. We should not assume that meeting the regulatory definition means the data is actually safe in an AI workflow. The lawyer who accepts a HIPAA-derived definition without scrutinizing what AI capabilities can do to that data is importing a standard from a different problem.
Pick a specific standard or you have not picked one. "Generally accepted industry standards" points at nothing. There is no settled industry standard for anonymization, and a court will apply whatever it wants after the fact. Reference the HIPAA expert determination method, a NIST de-identification framework, or a specific GDPR regulator's view, but pick something. Vagueness hurts both sides.
Watch reasonableness clauses for the timing problem. A definition that hinges on what is reasonable to expect at the time of dispute exposes the vendor to a moving standard as re-identification techniques improve. Anchor reasonableness to the time of processing, with the technical means then available, and exclude capabilities developed thereafter. That fixes a defect in almost every AI-drafted definition we are seeing. Customers should not accept this fix without negotiating something in return, because the practical effect is to lock in whatever level of protection existed on day one. Both sides need to be honest about the trade.
The party doing the anonymization should not also be the one certifying it. Self-certification and self-assessment clauses look efficient on paper and undermine the entire point of the obligation in practice. The party with commercial interest in expanding the scope of "anonymized" data is the same party deciding whether the standard was met. That is not a workable arrangement when the dataset matters. The customer needs an independent verification path, whether that is a third-party statistician, a contractual audit right, or both. Push for it on any dataset where the underlying privacy risk is real.
De-identifying data may make your processor a controller. Some European regulators treat de-identification itself as a controller activity. A processor who anonymizes the dataset on the controller's behalf may end up promoted to controller status by virtue of the activity itself. That changes the regulatory posture of the original controller in ways nobody intended, and it exposes them to risks they did not contract for. This is the kind of point that does not show up in the standard contract checklist. It can change how you draft the entire data clause, particularly the rights you grant to a processor.
Synthetic data is not automatically clean. Generative models can memorize and reproduce specific records from training data. They can carry distributional fingerprints that re-identify individuals in small populations. A definition that says synthetic data does not constitute personal data is making a technical claim that may not be true. Push back when the dataset is sensitive.
Synthetic data raises ownership questions that one-line definitions cannot resolve. When the vendor's model generates synthetic data from the customer's source data, both sides have a plausible claim. A clause that simply assigns all synthetic data to the vendor is not negotiating ownership, it is asserting it. We should treat that line as the start of a conversation, not the end. The right move is to work through what each side actually contributed and what each side needs to do with the output, then negotiate the rights grant against that. Anything else leaves real value on the table for the customer.
Quality obligations are missing from most synthetic data clauses. A customer accepting synthetic data in place of real data for downstream use needs the vendor to commit that the synthetic output is statistically representative. The standard definition imposes no such obligation. The customer who builds a model or runs an analysis on synthetic data without that commitment is taking on quality risk they probably did not price. Negotiate the obligation directly when it matters. Even a soft commitment to maintain statistical fidelity to the source is better than the silence that lives in most templates.
Push purpose limitations and post-termination scope on synthetic data generation rights. "Reasonably related to vendor's provision of services" expands as the vendor's services expand. List the permitted purposes in an exhibit, require notice for methodology changes, and limit post-termination use to the original purposes with a ban on sale or licensing. Every one of those changes is achievable with the right framing. Each one closes a real exposure that the templated language leaves open. This is one of the fights worth picking even on a deal where you are otherwise inclined to keep the markup light.

These AI clause webinars are stress tests for the kind of language showing up in vendor templates right now, and the recap series captures the takeaways for the lawyers who could not attend live. Our weekly newsletter brings the next session, the next recap, and the practical insights to your inbox. Subscribe now so you do not miss what comes next.