Behavior vs Structure

In the agent structure problem, and selection theorems more generally, there is a key distinction between behavior and structure. This matters because it is agent structure that is dangerous. Merely observing agent-like behavior on a certain subset of inputs does not necessarily imply that the behavior will generalize, whereas having agent-like structure does imply generalized agent-like behavior.

In the field of psychology, behaviorism refers to the (largely historical) subfield that attempts to understand people and animals by focusing on their externally-observable behavior and response to stimuli, rather than on things like introspection or internal mechanisms of cognition.

Analogously, in the context of our agent foundations research, an AI’s behavior refers to its actions as a function of its observations. Mathematically, functions are defined as sets of ordered pairs, which are the input and the output. Two functions are equal if their sets of ordered pairs are equal. There is no necessary sense of “how” these pairs are produced.

In the real world, any particular property is produced by a physical mechanism. It may be that two physical mechanisms produce the same input-output behavior, while being mechanically different on the inside. Any AI system that is built in reality will thus have an internal structure, and it is that structure that determines all the ways that the system could act. Therefore it determines under what input conditions the system will output dangerous behavior.