A Vocabulary For Expressing AI Usage Preferences

Internet-Draft	AI Preferences	June 2025
Keller & Thomson	Expires 7 December 2025	[Page]

Abstract

This document proposes a standardized vocabulary for expressing preferences related to how digital assets are used by automated processing systems. This vocabulary allows for the creation of structured declarations about restrictions or permissions for use of digital assets by such systems. The vocabulary is agnostic to the means by which it is conveyed. The definitions in the vocabulary facilitate a shared understanding between entities that express such preferences and those that use the associated digital assets.¶

1. Introduction

This document defines a common vocabulary of terms for automated systems that process digital assets. The primary purpose of this vocabulary is to enable machine-readable expressions of preferences about how digital assets are used by automated processing systems, in the context of training AI models and other forms of text and data mining (TDM).¶

The terms defined by the vocabulary can be used to describe, in a standardized way, the types of uses that a declaring party may wish to explicitly restrict or allow. Preferences are then expressed as a grant or denial of permission concerning each of the types of use defined in the vocabulary. This ensures that preferences can be communicated, processed, and stored in a consistent and interoperable manner.¶

The vocabulary is neutral to the technical details of how systems act on preferences. It is designed to ensure that preference information can be exchanged between different systems and consistently understood.¶

The vocabulary is intended to govern the use of digital assets for the training of AI models and other forms of automated processing. It does not concern itself with the mechanisms involved in obtaining digital assets (i.e., crawling).¶

The vocabulary is intended to work in contexts where such preferences result in legal obligations (such as rights reservations made by rightholders in jurisdictions with conditional TDM exceptions), and in contexts where this is not the case. It is without prejudice to applicable laws and the applicability of exceptions and limitations to copyright.¶

3. Statements of Preference

The vocabulary is a set of categories, each of which is defined to cover a class of usage for assets. Section 4 defines these categories in more detail.¶

A statement of preference is made about an asset. Statements of preferences can assign preferences to each of the categories of use in the vocabulary. Preferences regarding each category can be expressed either to allow or disallow the usage associated with the category.¶

A statement of preferences can express preferences about some, all, or none of the categories from the vocabulary. This can mean that no preference is expressed for a given usage category.¶

Some categories describe a proper subset of the usages of other categories. A preference that is expressed for the more general category applies if no preference is expressed for the more specific category.¶

For example, the TDM category might be assigned a preference that allows the associated usage. In the absence of any statement of preference regarding the AI Training category, that usage would be also be allowed, as AI Training is a subset of the TDM category. In comparison, an explicit preference regarding AI Training might disallow that usage, while permitting other usage within the TDM category.¶

After processing a statement of preferences the recipient can assume that each category of use has a preference in one of three states: "allowed", "disallowed", or "unknown".¶

4. Vocabulary Definition

This section defines the categories of use in the vocabulary.¶

Figure 1 shows the relationship between these categories:¶

Figure 1: Relationship Between Categories of Use

This list of specific use cases may be expanded in the future, should a consensus emerge between stakeholders, to include categories that address additional use cases as they emerge. In addition to these categories defined in the vocabulary, it is also expected that some systems implementing this vocabulary may extend this list with additional categories for their particular needs.¶

4.1. Text and Data Mining (TDM) Category

The act of using one or more assets in the context of any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.¶

The use of assets for TDM encompasses all the subsequent categories.¶

4.2. AI Training Category

The act of training machine learning models or artificial intelligence (AI).¶

The use of assets for AI Training is a proper subset of TDM usage.¶

4.3. Generative AI Training Category

The act of training General Purpose AI models that have the capacity to generate text, images or other forms of synthetic content, or the act of training other types of AI models that have the purpose of generating text, images or other forms of synthetic content.¶

The use of assets for Generative AI Training is a proper subset of AI Training usage.¶

4.4. AI Inference Category

The act of using one or more assets as input to a trained AI/ML model as part of the operation of that model (as opposed to the training of the model).¶

5. Usage

The vocabulary is used by referencing the terms defined in the Section 4 section above, directly or via mappings, in accordance with how they are defined in this document.¶

5.1. More Specific Instructions

A recipient of a statement of preferences that follows this model might receive more specific instructions in two ways:¶

Extensions to the vocabulary might define more specific categories of usage. Preferences about more specific categories override those of any more general category.¶
Statements of preferences are general purpose, machine-readable statements that cannot override contractual agreements or more specific statements.¶

For instance, a statement of preferences might indicate that the use of an asset is disallowed for AI Training. If arrangements, such as contracts exist that explicitly permit the use of that asset, those arrangements likely apply, unless the terms of the arrangement explicitly say otherwise.¶

The vocabulary does not preclude the use of other specific categories. Any statement of preference based on this vocabulary shall not be interpreted as restricting the use of the work(s) strictly for the purpose of search and discovery as long as no restriction is declared through search-specific means such as [RFC9309].¶

5.2. Vocabulary Extensions

Systems referencing the vocabulary must not introduce additional categories that include existing categories defined in the vocabulary or otherwise include additional hierarchical relationships.¶

6. Exemplary Serialization Format

This section defines an exemplary serialization format for preferences. The format describes how the abstract model could be turned into Unicode text or sequence of bytes.¶

The format relies on the Dictionary type defined in Section 3.2 of [FIELDS]. The dictionary keys correspond to usage categories and the dictionary values correspond to explicit preferences, which can be either y or n; see Section 6.2.¶

For example, the following is a preference to allow AI training (Section 4.2), disallow generative AI training (Section 4.3), and and state no preference for other categories other than subsets of these categories:¶

ai=y, genai=n

6.1. Usage Category Labels

Each usage category in the vocabulary (Section 4) is mapped to a short textual label. Table 1 tabulates this mapping.¶

Table 1: Mappings for Categories
Category	Label	Reference
Text and Data Mining	tdm	Section 4.1
AI Training	ai	Section 4.2
Generative AI Training	genai	Section 4.3
AI Inference	inference	Section 4.4

Any mapping for a new usage category can only use lowercase latin characters (a-z), digits (0-9), "_", "-", ".", or "*". These are encoded using the mappings in [ASCII].¶

6.2. Preference Labels

The abstract model used has two options for preferences associated with each category: allow and disallow. These are mapped to single byte Tokens (Section 3.3.4 of [FIELDS]) of y and n, respectively.¶

6.3. Text Encoding

Structured Fields [FIELDS] describes a byte-level encoding of information, not a text encoding. This makes this format suitable for inclusion in any protocol or format that carries bytes.¶

Some formats are defined in terms of strings rather than bytes. These formats might need to decode the bytes of this format to obtain a string. As the syntax is limited to ASCII [ASCII], an ASCII decoder or UTF-8 decoder [UTF8] can be used. This results in the strings that this document uses.¶

Processing (see Section 6.5) requires a sequence of bytes, so any format that uses strings needs to encode strings first. Again, this process can use ASCII or UTF-8.¶

6.4. Syntax Extensions

There are two ways by which this syntax might be extended: the addition of new labels and the addition of parameters.¶

New labels might be defined to correspond to new usage categories. Section 5.2 addresses the considerations for defining new categories. New labels might also be defined for other types of extension that do not assign a preference to a usage category. In either case, when processing a parsed Dictionary to obtain preferences, any unknown labels MUST be ignored.¶

The Dictionary syntax (Section 3.2 of [FIELDS]) can associate parameters with each key-value pair. This document does not define any semantics for any parameters that might be included. When processing a parsed Dictionary to obtain preferences, any unknown parameters MUST be ignored.¶

6.5. Processing Algorithm

To process a series of bytes to recover the expressed preferences, those bytes are parsed into a Dictionary (Section 4.2.2 of [FIELDS]), then preferences are assigned to each usage category in the vocabulary.¶

The parsing algorithm for a Dictionary produces a keyed collection of values, each with a possibly-empty set of parameters. The parsing process guarantees that each key has at most one value and parameters.¶

To obtain preferences for each of the categories in the vocabulary, iterate through the categories. For the label that corresponds to that category (see Table 1), obtain the corresponding value from the collection, disregarding any parameters. A preference is assigned as follows:¶

If the value is a Token with a value of y, the associated preference is to allow that category of use.¶
If the value is a Token with a value of n, the associated preference is to disallow that category of use.¶
Otherwise, a preference is not expressed for that category of use.¶

Note that this last alternative includes the key being absent from the collection, values that are not Tokens, and Token values that are other than y or n. All of these are not errors, they only result in no preference being inferred.¶

An important note about this process and format is that, if the same key appears multiple times, only the last value is taken. This means that duplicating the same key could result in unexpected outcomes. For example, the following expresses no preferences:¶

ai=y, ai="n", genai=n, genai, tdm=n, tdm=()

If the parsing of the Dictionary fails, no preferences are expressed. This includes where keys include uppercase characters, as this format is case sensitive (more correctly, it operates on bytes, not strings).¶

6.6. Alternative Formats

This format is only an exemplary way to represent preferences. The model described in Section 3, can be used without this serialization.¶

8. IANA Considerations

This document establishes a new standalone IANA registry entitled "Automated Processing Categories of Use".¶

This registry operates under the "Specification Required" policy; see Section 4.6 of [RFC8126].¶

New entries in this registry require the following information:¶

Label:: A short label that meets the constraints in Section 6.1.¶
Title:: A short title¶
Subset Of:: The category of use that this category is a proper subset of. This needs to refer to another entry in the registry by its label, or be the string "(none)".¶
Reference:: A link to a document that contains a precise definition for the category of use.¶

The Title and Label fields need to be separately unique across all items in the registry.¶

8.1. Initial Registry Contents

The registry is seeded with the values in Table 2.¶

Table 2: Initial Registry Contents
Label	Title	Subset Of	Reference
tdm	Text and Data Mining	(none)	Section 4.1
ai	AI Training	tdm	Section 4.2
genai	Generative AI Training	ai	Section 4.3
inference	AI Inference	tdm	Section 4.4

8.2. Registration Guidance

Assigned experts are responsible for ensuring that new items meet the technical requirements for the protocol. This minimally includes the syntax restrictions on labels (Section 6.1) and the potential for impossible reference loops in the Subset Of field. However, there are further special considerations for this registry that involve some exercise of discretion on the part of assigned experts.¶

Expressions of preference are most effective when they have a shared understanding. Keeping the limited number of items in the vocabulary limited is one of the best ways to ensure that there is wide understanding. That means that experts are instructed to reject registrations if there is any doubt regarding:¶

the scope of the proposed category of use,¶
the relationship between the proposed category of use and existing categories of use, or¶
the importance of being able to express preferences regarding the uses covered by the proposed category.¶

This involves a degree of judgment not ordinarily asked of assigned experts. Rather than expect an assigned expert to be able to resolve difficult cases, any case where there is doubt MUST be referred to the IETF to resolve. In effect, this ensures that contested registration requests are elevated to require "IETF Review"; see Section 4.8 of [RFC8126].¶

To aid in the decision-making process, new registration requests MUST be copied to the "aicontrol@ietf.org" list for at least three weeks of discussion before a decision is made. Experts are expected to use input from that list to inform their decision.¶

9. References

9.1. Normative References

[ASCII]: Cerf, V., "ASCII format for network interchange", STD 80, RFC 20, DOI 10.17487/RFC0020, October 1969, <https://www.rfc-editor.org/rfc/rfc20>.
[FIELDS]: Nottingham, M. and P. Kamp, "Structured Field Values for HTTP", RFC 9651, DOI 10.17487/RFC9651, September 2024, <https://www.rfc-editor.org/rfc/rfc9651>.
[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8126]: Cotton, M., Leiba, B., and T. Narten, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 8126, DOI 10.17487/RFC8126, June 2017, <https://www.rfc-editor.org/rfc/rfc8126>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
[RFC9309]: Koster, M., Illyes, G., Zeller, H., and L. Sassman, "Robots Exclusion Protocol", RFC 9309, DOI 10.17487/RFC9309, September 2022, <https://www.rfc-editor.org/rfc/rfc9309>.

9.2. Informative References

[EUCD2019]: European Union, "Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market", 17 May 2019, <https://eur-lex.europa.eu/eli/dir/2019/790/oj>.
[UTF8]: Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November 2003, <https://www.rfc-editor.org/rfc/rfc3629>.