Bartz v. Anthropic: A Mixed Verdict and a Clearer Map
Judge Alsup's 23 June opinion in Bartz v. Anthropic is the first substantive judicial ruling on fair use in the training of a large language model. It answers the easier question and reserves the harder one. We read the opinion as a guide to where training-data risk now actually sits.
On 23 June, Judge William Alsup of the Northern District of California issued a partial summary judgment in Bartz v. Anthropic, the class action brought by three authors over Anthropic's use of their books in the training of Claude. The opinion is the first substantive judicial decision on the fair-use posture of large language model training, and it deserves the careful reading the policy press has not, on the whole, given it. Two paragraphs in the headline coverage have done more damage to public understanding than they have helped: the first reporting that Anthropic won; the second reporting that Anthropic lost. Both are partly true. The shape of the truth, however, is in a finding most reports skimmed — that the lawful acquisition of training inputs is the determinative legal question, and the transformativeness of training is, for the first time judicially, no longer seriously contested.
The case is not over. A trial is set on the question reserved by the partial summary judgment — whether Anthropic's storage of pirated copies of the plaintiffs' books, downloaded from Library Genesis and the Pirate Library Mirror, constituted infringement separate from the training use. Several other AI training cases — New York Times v. OpenAI foremost among them — remain in motion practice or discovery. We do not yet have a binding appellate ruling. What we have is a clear, careful district-court opinion from a judge whose technology cases are well-respected, on the central question whose answer has been needed for two years. The effect of the opinion on industry practice is already visible. The effect on the policy conversation around AI training and copyright will be larger still.
What the court actually decided
The plaintiffs in Bartz alleged that Anthropic committed copyright infringement by (a) copying their books into a training corpus and using that corpus to train Claude models, and (b) maintaining persistent digital copies of their books on Anthropic's systems throughout the relevant period. The complaint addressed both lawfully acquired copies (purchases by Anthropic from print and digital booksellers, scanned into machine- readable form) and unlawfully acquired copies (downloads from Library Genesis, the Pirate Library Mirror, and similar shadow-library sources).
Judge Alsup's opinion treated the case as two distinct questions. On the first question — whether the use of lawfully acquired books in training was infringing — the court granted summary judgment to Anthropic on fair-use grounds. The fair-use analysis tracks the four-factor framework of Section 107 of the Copyright Act. On the first factor (purpose and character), the court found that training is highly transformative: the inputs are books, the output is a statistical model whose function is to generate predicted tokens, and the model does not reproduce the inputs as such except in carefully constructed adversarial contexts that the parties had not shown to be commercially significant. The court relied heavily on the Second Circuit's reasoning in Authors Guild v. Google, the Google Books case, and rejected the plaintiffs' argument that the Warhol decision required a different result.
On the second and third factors (nature of the work and amount used), the court found in favor of fair use on the second factor — the works were creative, weighing against fair use, but the court did not give this factor decisive weight — and found that the use of the entire work was justified by the transformative purpose, citing Google Books again. On the fourth factor (effect on the market), the court rejected the plaintiffs' argument that the existence of an emerging market for training-data licensing displaced fair use. The court reasoned, in a passage that will be widely cited, that the existence of a market for licensing a particular kind of use does not by itself foreclose the fair-use defense; if it did, the defense would be circular, vanishing whenever rightholders chose to assert that they would have charged for the use.
The court reasoned that the existence of a market for licensing a particular use does not, by itself, foreclose fair use. If it did, the defense would vanish whenever a rightholder chose to assert it.
On the second question — whether the storage of unlawfully acquired copies was infringing — the court denied summary judgment to both sides and reserved the question for trial. The court's reasoning here is the part of the opinion most under-attended in the press coverage. The court accepted, for purposes of the motion, that Anthropic's downloading of the books from Library Genesis and the Pirate Library Mirror constituted reproduction under the Copyright Act. The defense to that reproduction cannot, in the court's view, rest on the same fair-use analysis that justifies the training use; storage in a shadow library is not, in itself, transformative, and the question whether the eventual training use justifies the antecedent storage is one the court was unwilling to decide on summary judgment. The reserved question is framed narrowly: whether Anthropic's storage of pirated copies, separately from the training use, gives rise to damages.
The map this draws
The combined effect of the rulings, in our reading, is to relocate the locus of training-data risk. The risk no longer sits primarily in the question of whether training itself is fair use; on the current authority, in the Northern District of California, it is. The risk sits in the antecedent question of how the training data was acquired. That is a different problem, with different operational implications, and our clients should be planning around it.
First, the provenance of training data is now a commercial-grade compliance question. The Bartz opinion would, on its current reasoning, treat differently a training corpus assembled from lawfully purchased books, properly licensed periodicals, and openly licensed web content, on the one hand, and a corpus assembled from shadow-library downloads, scraped paywalled content, and improperly licensed code repositories, on the other. Even if the eventual training use is fair use in both cases, the upstream acts of reproduction in the second case are exposed to infringement liability that the first case is not. Clients who have inherited training corpora from earlier development cycles should be auditing those corpora for provenance, and should be in a position to demonstrate the legality of each principal source.
Second, the licensing strategies that several large laboratories have been pursuing over the last eighteen months — bulk content licenses with publishers, image- catalog licenses with stock-media providers, code licenses with open-source project foundations — are not, on this opinion, the legal price of training itself. They are insurance against the upstream provenance problem. That is a useful clarification, because it changes the deal terms that licensing should command. Licenses for training inputs do not need to be priced as if they were the difference between legal and illegal training; they need to be priced as if they were the difference between corpus-with-clean-provenance and corpus-with-mixed-provenance. The latter is a more defensible commercial proposition.
Third, the opinion provides no comfort whatsoever on output liability. The court was careful to note that the transformative-purpose finding turned on the model's ordinary operation as a probability distribution, not on what the model produces in adversarial extraction scenarios. Plaintiffs in other matters — most prominently the Times case — have alleged that current models can be induced to produce verbatim or near-verbatim reproductions of training inputs in response to crafted prompts. Bartz does not address that question. A laboratory whose models can be induced to extract training inputs faces a different fair-use analysis than a laboratory whose models cannot, and the operational mitigations that distinguish the two — output filters, training-time deduplication and memorization controls, near-duplicate detection in retrieval — remain load-bearing for legal posture.
What this means for the other cases
Bartz is not binding outside the Northern District of California, and it will not be binding within the Northern District until the Ninth Circuit takes a position. The opinion is, nevertheless, the most carefully reasoned district-court treatment of fair use in AI training to date, and other district courts are likely to follow it on the central transformativeness finding. The cases that are likely to look different — New York Times v. OpenAI, the music-publisher cases against several model developers, the visual-arts cases against image-generation developers — are those in which the output-side memorization or imitation allegations are stronger. Bartz does not foreclose those cases, and the court was explicit about not deciding them.
For the Times case in particular, the Bartz reasoning is, on balance, helpful to defendants on the input-side question, neutral on the output-side question, and silent on the contract-and- tort claims (hot-news misappropriation, contributory infringement of subscribers) that the Times pleaded as alternatives to the direct-infringement copyright theory. Those alternative theories were arguably underweighted in the early commentary on the case; they will be more important to watch now that the direct-infringement theory has received its first negative district-court treatment.
The European interaction
The American fair-use defense has no European analogue, and we caution clients against reading too much cross-Atlantic significance into the Bartz opinion. The relevant European framework is the text-and-data-mining exceptions in the 2019 DSM Directive, Articles 3 and 4. Article 3 permits TDM by research organizations for scientific research purposes and cannot be opted out of. Article 4 permits TDM by any person but is subject to a rightholder opt-out, which must be expressed in a machine-readable form for content made available online. The EU AI Act's general-purpose model regime layers on top of this framework an obligation to maintain a policy to respect EU copyright law, including the Article 4 opt-out, and to publish a sufficiently detailed summary of training content.
The European framework is therefore, in its operative mechanism, a notice-and-opt-out regime: rightholders who have not exercised the opt-out are taken to have permitted TDM, including for the training of commercial AI models. That is a regime structurally different from the American fair-use analysis, and a model developer operating in both markets needs to comply with both. The Bartz opinion is, in this respect, an American-specific clarification. It does not relieve a developer of the European obligation to honor the Article 4 opt-out, nor of the AI Act's training-data summary requirement, which will become enforceable in August.
Operating advice for clients
We are advising clients on five operational changes in the wake of Bartz. First, conduct a provenance audit of training corpora used in any model offered or planned to be offered commercially. Identify and remediate inputs whose acquisition cannot be defended as lawful. Where remediation requires retraining or replacement, build the calendar for that work now; retroactive cleanup of corpora used in deployed models is not, by itself, a complete defense.
Second, develop and maintain a public training-data summary template that meets the EU AI Act's August obligations and that is also useful as a U.S. provenance-defense exhibit. The two purposes are not identical, but the documentation work overlaps substantially, and clients should not be producing two separate sets of training-data records for the same training run.
Third, re-evaluate licensing relationships in light of the reframing. Licenses that were procured as a fair-use hedge can, in many cases, be renegotiated on different terms now that the fair-use posture is clearer. The negotiating leverage cuts both ways: laboratories have a stronger position on the transformative-use question; rightholders have a stronger position on the provenance question. A balanced license that recognizes both is the deal we expect to see most often over the next twelve to eighteen months.
Fourth, harden output-side mitigations. Memorization tests, near-duplicate output filters, and retrieval- time deduplication controls are now part of the defensible operational posture for any model whose training inputs include copyrighted material. The cost of these mitigations is real but manageable, and they materially improve a defendant's position in output-side litigation.
Fifth, prepare for shareholder and board-level conversations on training-data risk. The Bartz opinion will be read at the board level not as a definitive win but as a clarification of where the residual risk sits. Boards will want to know how that residual risk is being managed. The work of preparing a credible answer is, in our experience, best done by the general counsel and head of safety jointly, with the engineering and licensing organizations brought in as needed.
The training-data question has been the most consequential unresolved legal question in commercial AI for two years. It is not yet fully resolved. Bartz is, however, the first opinion that makes the residual risk specific enough to manage. Clients who treat the opinion as a license to relax their training-data discipline will, in our view, find that judgment retrospectively unwise. Clients who treat the opinion as a map of where the remaining risk sits, and adjust their operations accordingly, will be better placed than they were three weeks ago. That is more progress on this question than the field has seen since the first complaints were filed.