Drafting Data Definitions in AI Contracts That Hold Up

Most of us learned to read AI contracts the way we read SaaS contracts. Find the data clauses, check the train-on-customer-data restriction, push back on the usage rights, move on. The problem is that the protective clauses we rely on all sit on top of a definition. And in AI contracts, the definitions of input data, output data, and system data are doing more work than they ever did in traditional software, because the technology behind them is iterative, evolving, and pulling in data from places we did not used to have to think about.

The webinar was the first in How to Contract's new provision makeover series. Host Laura Frederick brought on David Sclar, Principal Product Counsel at Babylist, to take the provider side, and Laura Belmont, General Counsel at The L Suite (TechGC), to take the customer side. David has spent a decade in-house focused on tech, privacy, and product. Laura previously built the AI governance program at a data science and AI company, and that combination of vendor-side and buy-side experience showed up throughout the conversation.

We worked through three AI-slop definitions on screen. Each one looked harmless. Each one let way too much through once we slowed down on the words. The goal was not to land on a perfect draft, but to surface the questions you have to answer before you can write one of these definitions well.

Here are our top ten takeaways from the speakers' comments during the webinar:

Treat AI contracts as data contracts. Once we accepted that almost every substantive provision in an AI contract turns on data, the definitions stopped feeling like boilerplate. The data definition is what draws the line between what stays yours and what becomes the provider's to learn from, monetize, and redistribute. If you only fight over the no-training language and skip the definitions feeding into it, you are protecting an empty box.
Stop drafting like AI is deterministic software. Traditional software is deterministic. Same input, same output. Generative AI is probabilistic and it learns from us as we use it. That means our prompts, our iterations, our feedback, and HOW we use the tool are all carrying value into the system. The definitions in the contract need to reflect that, not pretend we are still licensing a calculator.
Vagueness rarely helps either side. It is tempting to think loose drafting favors the provider, since broad words give them room. In practice, vagueness creates risk for both parties because nobody knows how a vague AI clause will be read later. Tight, specific definitions give the provider real operating room and give the customer a real boundary. Aim for clarity, not posture.
Expect the technology to outrun the definition. A lot of bad AI definitions are not malicious. They were written for the chatbot world, with one user prompting and one response coming back. Connectors changed that. Agents changed it again. When you write these definitions today, build in mechanisms that let them age, including including-but-not-limited-to language and good-faith renegotiation triggers when the technology materially shifts.
"Submitted by the customer" is not broad enough anymore. A definition that captures only what the customer types into a box misses what the connector pulls in and what the agent retrieves on the customer's behalf. If you are the customer, push for "submitted, accessed, or retrieved." If you are the provider, you may want this narrow scope, but you should know exactly what you are excluding before you choose it.
Inputs carry your strategy, not just your data. When customers ask AI tools to do their hardest work, the prompts, instructions, and fine-tuning files are essentially a map of how the company operates. That is competitive intelligence even when nothing in it is confidential in the traditional sense. Customer-side definitions should explicitly cover prompts, instructions, commands, attached files, and fine-tuning inputs, not just "data submitted."
Outputs are iterative, so the definition has to be too. AI does not produce one clean output. It produces a stream of intermediate responses, follow-up prompts, refinements, and end products. If your output definition only captures the final artifact, the provider can argue that everything else along the way is fair game. Make sure the definition covers the whole generation process, not just what came out the end.
Aggregated and anonymized is not the safe harbor it sounds like. Stripping a company name off a result does not strip the value out of it. Survey results, usage patterns, and analytical findings can still expose strategy and competitive intelligence even when the source is anonymized. Apply extra scrutiny to any clause that lets the provider use your data once it has been "aggregated and anonymized," and know that re-identification is materially easier with AI in the mix.
System data is where the catchall hides. It is easy to focus the negotiation on input and output and never look hard at the system data clause. That is exactly where providers often pile in telemetry, performance metrics, usage logs, behavioral analytics, and sometimes pieces of customer data dressed up as metadata. Read this definition with the same care you read the input definition, and exclude your input and output definitions from it explicitly so nothing leaks back across.
If you cannot fix the contract, work the operations. Plenty of us are stuck with AI tools we already signed for under vague definitions. There is still real risk mitigation available. Get written confirmation from the vendor about how they actually use the data. Redact before pasting in. Use unique identifiers for sensitive records. Turn off training toggles. Look for zero-data-retention APIs. Talk to your technical admins about what the tool can and cannot scope. The contract is one lever. The deployment is another.

How to Contract runs a weekly newsletter that links to upcoming webinars and shares recaps like this one from past ones. Subscribe now so the next provision makeover lands in your inbox even if you cannot join live.