Part E - Interaction

Cognitive Models and Architectures

Introduce cognitive models of interaction
Describe the GOMS, CCT, BNF and TAG models
Describe the ACT-R and Soar cognitive architectures

"A cognitive model is an approximation to animal cognitive processes (predominantly human) for the purposes of comprehension and prediction" Wikipedia (2009)

GOMS | CCT | Linguistic | Device | Problem Space | Architectures | Exercises

Cognitive models help us understand human cognitive processes and how humans interact with interfaces.  Based on these models, we can make behavioral predictions.  Cognitive architectures step behind behavioral modeling to specify structural properties of the modelled system.  These architectures help constrain our development of cognitive models. 

Cognitive models include models of

  • user's tasks and goals
  • user-system grammar
  • human motor skills

We make assumptions about the user in each of these models.  These assumptions augment those we have already made such as the distinction between short-term memory and long-term memory. 

The question of depth of decomposition arises with models of tasks and goals, which decompose into simpler parts.  We can proceed to the lowest level operations if we choose to do so, but may stop at some more abstract level.  We define the unit task as the most abstract task that the user can perform without requiring any problem solving activity on their part.  The second question that arises is where we start our analysis in the hierarchy of goals.  The higher we start, the more complex will our analysis be.  The start and depth questions are both issues of granularity. 


The GOMS model has served as the basis for other models.  It can be combined with other models to make more advanced predicitions.  Card, Moran, and Newell introduced this model in 1983.  GOMS stands for Goals, Operators, Methods and Selection. 

  • Goals
    • what the user wants to achieve
    • represent a memory point from which the user can evaluate what has been achieved and to which the user can return in the event of any error
  • Operators
    • the simplest actions the user performs in using the system
    • operators may affect the system or just the mental state of the user
    • pressing the 'X' key or reading the dialog box are examples of operators
  • Methods
    • there may be more than one way to reach a specific goal
    • for example, in a certain application the user can activate the spelling feature by selecting the spelling option from the drop-down tools menu, by clicking the spelling icon in the toolbar, or by pressing the F7 key
    • we refer to these methods as the three methods for the same goal: MENU-METHOD, ICON-METHOD and the F7-METHOD
  • Selection
    • if there is more than one method available to achieve a goal, a selection must be made
    • the choice of methods usually depends upon the state of the system and on the particular user

GOMS uses hierarchies to decompose a larger goal into sub-goals.  A typical GOMS analysis breaks a high-level goal into unit tasks which are further decomposed into basic operators.  GOMS uses the term 'select' to identify a choice of methods. 

An example of a GOMS hierarchy is:

.     [select GOAL: USE-MENU-METHOD
.             .     move mouse to window header
.             .     tools menu
.             .     click spelling option
.             GOAL: USE-ICON-METHOD
.             .     move mouse to toolbar
.             .     click spelling icon
.             GOAL: USE-F7-METHOD
.             .     press F7-key

The dots indicate the level of each goal within the hierarchy. 


By analyzing a GOMS goal structure we can create metrics of performance.  Assigning a time to each operator and summing the result has yielded estimates within 33% of the actual values.

We can use the depth of the hierarchy to measure how much information the user must store in short-term memory.

We can use the selection rules to predict the actual commands that will be used.  This has allowed predictions that were 90% accurate.

Cognitive Complexity Theory

Kieras and Polson introduced Cognitive Complexity Theory (CCT) in 1985 shortly after the publication of GOMS.  CCT is an extension of the GOMS model that improves predictability.  The main use of CCT is in measuring the complexity of an interface.

CCT provides two parallel descriptions

  • the user's goals
  • the system

The description of the user's goals consists of a series of production rules of the form

 if condition
    then action
where condition is a statement about the contents of working memory and action consists of more elementary actions, which may be changes to working memory or external actions such as keystrokes.  We write these production rules in a LISP-like language. 

For example, consider the description of how a user could insert a missing space into text using the vi text editor:

 IF(AND (TEST-GOAL perform unit task)
   (TEST-TEXT task is insert space)
   (NOT (TEST-GOAL insert space))
   (NOT (TEST-NOTE executing insert space)) )
 THEN ( (ADD-GOAL insert space)
   (ADD-NOTE executing insert space)
   (LOOK-TEXT task is at %LINE %COL) ))
 IF(AND (TEST-GOAL perform unit task)
   (TEST-NOTE executing insert space)
   (NOT (TEST-GOAL insert space)) )
 THEN ( (DELETE-NOTE executing insert space)
   (DELETE-GOAL perform unit task)
 IF(AND (TEST-GOAL insert space)
   (NOT (TEST-GOAL move cursor))
 THEN ((ADD-GOAL move cursor to %LINE %COL)))
 IF (AND (TEST-GOAL insert space)
   (DELETE-GOAL insert space) ))

To see how this set of production rules works, assume that the user has just realized that they have made a mistake and that the contents of their working memory is now

 (GOAL perform unit task)
 (TEXT task is insert space)
 (TEXT task is at 5 23)
 (CURSOR 8 7)

SELECT-INSERT-SPACE will fire since it is the only option that meets all of the conditions.

The contents of working memory after the firing of SELECT-INSERT-SPACE are

 (GOAL perform unit task)
 (TEXT task is insert space)
 (TEXT task is at 5 23)
 (NOTE executing insert space)
 (GOAL insert space)
 (LINE 5)
 (COL 23)
 (CURSOR 8 7)

INSERT-SPACE-1 will fire since it is the only option that now meets all of the conditions.

The contents of working memory after the firing of INSERT-SPACE-1 are

 (GOAL perform unit task)
 (TEXT task is insert space)
 (TEXT task is at 5 23)
 (NOTE executing insert space)
 (GOAL insert space)
 (GOAL move cursor to 5 23)
 (LINE 5)
 (COL 23)
 (CURSOR 8 7)

INSERT-SPACE-2 will fire since it is the only option that now meets all of the conditions.

The contents of working memory after the firing of INSERT-SPACE-2 are

 (GOAL perform unit task)
 (TEXT task is insert space)
 (TEXT task is at 5 23)
 (NOTE executing insert space)
 (LINE 5)
 (COL 23)
 (CURSOR 5 23)

and the I, SPACE and ESC keys have been pressed.  Now that the insert space goal has been removed, INSERT-SPACE-DONE will fire since it is the only option that now meets all of the conditions.  This option cleans up working memory.

CCT allows us to model GOMS-like hierarchies with concurrent goals since more than one rule can be matched at the same time. 

We can use CCT to model the system as well.  In this case, we can compare our models for the user and the system to find mismatches and produce a measure of dissonance.  That is, we can use CCT to predict the difficulty in translating from the user's model to the system model.  The sheer size of a CCT description is a predictor of the complexity of the operations necessary to achieve a goal. 

Linguistic Models

Models have been built that treat interaction as a language recognizing that the interaction between a user and a system is similar to a language. 

Backus-Naur Form

Backus-Naur Form (BNF) was originally developed to describe the syntax of programming languages.  Reissner (1981) applied BNF to the description of dialog grammar.  We can use BNF to describe the interaction between a user and a system. 

Consider the drawing of a polyline in a graphics system.  A polyline is a line that may consist of several line segments.  The user clicks the mouse at the end of each segment and double-clicks it at the end of the line.  The BNF description of this process is:

 draw-line      ::= select-line + choose-points + last-point
 select-line    ::= position-mouse + CLICK-MOUSE
 choose-points  ::= choose-one | choose-one + choose-points
 choose-one     ::= position-mouse + CLICK-MOUSE
 last-point     ::= position-mouse + DOUBLE-CLICK-MOUSE
 position-mouse ::= empty | MOVE-MOUSE + position-mouse

The identifiers are of two types: lower case and upper case.  The lower case identifiers are non-terminals and the upper case identifiers are terminals, which represent the lowest level of user intervention.  We define the non-terminals in terms of other non-terminals and terminals using statements of the following form

 name ::= expression

The symbol '::=' denotes 'is defined as'.  Only non-terminals appear on the left of the definition operator.  The right side consists of identifiers separated by sequence ('+') and selection('|') operators. 

BNF represents the user's actions but not the system's responses to those actions.  The complexity of a BNF description provides a crude measure of the complexity of the task.  BNF is also a good way to unambiguously specify how a user interacts with a system.

Task-Action Grammar

While BNF can represent the structure of a language, it cannot represent

  • consistency in commands
  • knowledge that the user has of the world

Task Action Grammar (TAG) addresses both of these problems.

Consider using BNF for the UNIX copy, move and link commands:

 copy ::= 'cp' + filename + filename | 'cp' + filenames + directory
 move ::= 'mv' + filename + filename | 'mv' + filenames + directory
 link ::= 'ln' + filename + filename | 'ln' + filenames + directory

The TAG description of these same commands makes the consistency far more apparent:

 file-op[Op] := command[Op] + filename + filename
              | command[Op] + filenames + directory
 command[Op=copy] := 'cp'
 command[Op=move] := 'mv'
 command[Op=link] := 'ln'

TAG can represent world knowledge.  Consider the following set of statements:

 movement[Direction] := command[Direction] + distance + RETURN
 command[Direction=forward]  := 'go 395'
 command[Direction=backward] := 'go 013'
 command[Direction=left]     := 'go 712'
 command[Direction=right]    := 'go 956'

We can rewrite the set to perform known actions:

 movement[Direction] := command[Direction] + distance + RETURN
 command[Direction=forward]  := 'FORWARD'
 command[Direction=backward] := 'BACKWARD'
 command[Direction=left]     := 'LEFT'
 command[Direction=right]    := 'RIGHT'

This second form takes advantage of words (forward, back, etc.) that the user already knows.  We rewrite this set to show the information that the user already knows and does not have to learn:

 movement[Direction] := command[Direction] + distance + RETURN
 command[Direction]  := known-item[Type=word,Direction]
 * command[Direction=forward]  := 'FORWARD'
 * command[Direction=backward] := 'BACKWARD'
 * command[Direction=left]     := 'LEFT'
 * command[Direction=right]    := 'RIGHT'

The rules marked with asterisks can be generated from the second rule combined with the user's current knowledge.

Device Models

BNF and TAG were designed for command line interfaces.  The model for pressing a button is relatively straightforward.  The model for moving a mouse one pixel is less obvious. 

In GUI systems, buttons are virtual and depend upon what is displayed at a particular screen position.  The keystroke model allows us to model low-level interaction with a device. 

Keystroke Level Model

The keystroke level model is used to model simple interaction sequences on the order of a few seconds.  It does not extend to more complex operations such as producing an entire diagram. 

The model decomposes actions into 5 motor operators, a mental operator and a response operator. 

  1. K keystroke operator
  2. B pressing a mouse button
  3. P pointing or moving the mouse over a target
  4. H homing or switching the hand between mouse and keyboard
  5. D drawing lines with the mouse
  6. M mentally preparing for a physical action
  7. R system response (the user does not always wait for this as happens in continuous typing)

Consider using a mouse based editor to correct a single character error

  • point at the error
  • delete the character
  • retype it
  • return to the previous typing point

The following sequence will capture this:

 1. Move hand to mouse               H[mouse]
 2. Position after bad character     PB[LEFT]
 3. Return to keyboard               H[keyboard]
 4. Delete character                 MK[DELETE]
 5. Type correction                  K[char]
 6. Reposition insert point          H[mouse]MPB[LEFT]

We can measure timings for individual operators and then sum them to create the total time for the overall operation.  We can compute times for alternative ways of reaching a goal and compared the different ways to find which one is more efficient.

Three State Model

Pointing devices like mice, trackballs, and light pens all behave differently as far as the user is concerned.  The three-state model captures the behavior of these devices:

  • State 0:
    • the device not touching the screen
    • the location of the device is not tracked at all
  • State 1:
    • moving the device with no buttons pressed
    • usually moves the pointer on the screen
  • State 2:
    • depressing a button over an icon and then moving the device
    • usually thought of as dragging an object

A touch screen is a state 0-1 device.  A mouse is a state 1-2 device.  A tablet with a stylus is a state 0-1-2 device.

three state model

A touch screen behaves like a light pen with a button to press.  This means that a touch screen is in state 0 when the finger is off the screen.  When the finger touches the screen, it is in state 1 and can be tracked.  If the user applies extra pressure, this can simulate a button press.  In this case, a touch-screen is a state 0-1-2 device. 

Fitt's Law

Fitt's law is important with such devices.  The law states that the time to move a pointer to a target of size S at a distance D from the starting point is

 a + b log2(D/S + 1)

where a and b are empirical constants dependent on the type of pointing device and the skill of the user. 

The insight provided by the three-state model is that a and b also depend upon the state of the device: dragging is more accurate than the original pointing which does not have as good feedback. 

Problem Space Model

Many models involve specification of a sequence of tasks leading to a specific goal.  This is the behavior that we expect of a knowledge based system.  This is in contrast to problem solving modeled as a search of a solution space until a solution is found.  The search involves traversing the space until a solution is found. 

A problem space model consists of

  • a set of states
  • a set of operations to go from one state to another
  • a goal is a subset of states which must be reached for the goal to be achieved

To solve a problem in this model, we

  • identify the current state
  • identify the goal
  • devise a set of operations which will move from the current state to the goal state

This model is inherently recursive: if we cannot find the operations to achieve this goal then this becomes a new recursive problem to be solved.

Cognitive Architectures

Cognitive architectures describe structural properties of a cognitive system.  These architectures presume that processes are implemented in the system and describe the properties of that system as a whole.  Some researchers assume a computational system, while others assume a connectionist system. 

Architectures differ from cognitive models in that models focus on one particular competence, while architectures tend to be based on a set of generic rules.  Cognitive models without explicit architectures imply some underlying model of the mental processes of the user.  For example, GOMS implies a divide and conquer methodology. 

Adaptive Control of Thought - Rational

ACT-R assumes that each task that a human performs consists of a sequence of discrete operations.  ACT-R is implemented in the LISP programming language to interpret its own special coding language.  We create your own model of human cognition in the form of a script.  Running the model simulates human behavior by specifying each cognitive operation (memory operations, perceptual operations, motor operations, etc.). 

ACT-R assumes the human knowledge consists of declarative and procedural memory, which are irreducible to one another.  ACT-R contains two specialized modules:

  • perceptual-motor modules - interface with the real world
  • memory modules - facts and productions

ACT-R belongs to the symbolic or computational approach to cognition.  Its entities - chunks and productions - are discrete and its operations are syntactical. 

ACT-R has been used to model natural language understanding and production, to capture how humans solve complex problems, to predict patterns of brain activation, and as a foundation for cognitive tutors. 


Laird, Newell, and Rosenbloom developed Soar at Carnegie Mellon and published it in 1983.  SOAR originally stood for State, Operator, And Result.  The Soar development community subsequently abandoned this acronym but retained the word.  Soar is now maintained by John Laird's research group at the University of Michigan. 

Soar is a general cognitive architecture for developing systems that exhibit intelligent behavior.  It is based on the physical symbol system hypothesis, which states that

"A physical symbol system has the necessary and sufficient means for general intelligent action." (Newell and Simon, 1976)
This idea has philosophical roots in Hobbes, Leibniz, Hume, and Kant.  Its latest version is the computational theory of mind associated with Hilary Putnam and Jerry Fodor. 

Recent versions of Soar have incorporated non-symbolic representations and processes including reinforcement learning, imagery processing, and emotion modeling.