Assistivox SLIStack

Assistivox SLIStack is the architecture Assistivox uses to implement the Spoken Language Interface (SLI). It sits between the abstract SLI model and the concrete user interface code. Its purpose is to define how actions, state, and modalities are structured so that spoken language is a native way to operate the system.

1. Architectural goal

The goal of SLIStack is to give Assistivox a single, coherent action model that can be reached through multiple input and output paths. Every meaningful user action should exist as a command action in the architecture. Each command action should be reachable through at least one non-spoken path and one spoken path.

This approach keeps the action space stable while allowing the surrounding command languages and interface surfaces to evolve. It also lets Assistivox treat speech as one of several parallel ways to invoke the same underlying capabilities.

2. Command actions and hooks

The basic unit in SLIStack is the command action. A command action represents something a user can cause the system to do: open a document, move the insertion point, apply a style, run a transformation, save, cancel, or inspect context.

Command actions are exposed through command hooks. A command hook is the system-level entry point for an action. It is what user interface surfaces bind to. A single hook can be bound to:

a button or toggle in a graphical interface,
an entry in a text or menu-based interface,
a keyboard shortcut,
a typed command,
a spoken command handled by SLI.

The visible control is therefore one presentation of the action. The hook is the action itself.

3. Why actions come first

In SLIStack, command actions are defined before the spoken command language. Graphical controls are discrete and easy to place on screen. It is straightforward to add a button or toggle once an action exists. Voice command language is more brittle. Spoken commands must be recognized reliably by dictation and must remain usable over time, which means they depend on a clear and reasonably stable action space.

For that reason, Assistivox defines the command actions and hooks first. Only after the action set is coherent does it define the initial spoken phrasing that will trigger those hooks.

4. Lexicon layer for spoken commands

On top of the action model, SLIStack provides a lexicon layer for spoken commands. The lexicon maps spoken phrases to command actions. A single command action may have multiple entries in the lexicon. That allows the spoken command language to support more than one natural phrasing for the same action.

The lexicon is separate from dictation. Dictation concerns turning speech into text. The lexicon concerns turning speech into actions. Keeping these concerns separate allows the command language to evolve the command phrasing and synonyms without changing the underlying action model.

In early versions, the lexicon can be implemented as direct mappings from normalized phrases to command actions. Later versions can enrich this layer with intent recognition.

5. AI-assisted command resolution

In this design, AI assists command resolution. It does not define the action space. The model operates over an existing set of command actions and hooks. When confidence in a mapping is above a configured threshold, the system may execute the corresponding action. When confidence is below that threshold, the command handler should not guess. It should respond through SLI that the command was not understood and preserve the current state.

Spoken command handling in SLIStack should therefore obey two rules:

High-confidence mappings may execute the associated command action.
Low-confidence or unmapped utterances must produce explicit feedback through SLI rather than silent failure.

In this respect, spoken interaction should behave like a command environment. An unrecognized command should produce a clear message, not unclear or hidden behavior.

6. Invariant state introspection

SLIStack maintains an invariant of state introspection. In each state that allows input, the user must be able to obtain, through SLI, an explanation of the current state and the actions available from that state.

The specific phrasing of those questions depends on the language design. The invariant does not. For any state that accepts input, SLI must support a context introspection action that:

identifies the current state in terms that matter to the user; and
describes the actions available from that state.

For a user working through SLI, icons and buttons on the screen may not be part of the user’s reality. What matters is which state the system is in and what outgoing transitions are available from that state. Describing “a button you could click” is a metaphor into a visual world and adds an extra cognitive layer. The answer to “what can I do here?” should therefore be expressed in terms of SLI (actions and outcomes in the current state) rather than in terms of how those actions happen to be drawn on the screen.

This requirement follows from the differences between visual and spoken interaction. A sighted user can usually see the state of the application and the available options: visible windows, menus, buttons, and focus indicators. A user interacting through SLI cannot inspect the screen in this way.

7. Spoken interruption parity

SLIStack also maintains spoken interruption parity. If a state can be interrupted, cancelled, escaped, or backed out of through another input path, then an equivalent spoken action must be available through SLI.

This rule is analogous to the invariant state introspection rule. In many applications, keyboard interaction offers a reliable way to interrupt or cancel actions (for example, the Escape key).

In SLIStack, spoken interaction should offer the same level of control. A user should be able to interrupt a long-running operation, exit a nested state, or back out of a prompt using SLI, not only through a keyboard or mouse.

8. Modalities and shared structure

SLIStack is not a speech-only architecture. It is designed for multiple modalities that share a common structure. The same command actions and hooks can be bound to:

graphical controls,
text or menu-based interfaces,
keyboard shortcuts,
typed commands,
spoken commands through SLI.

This arrangement allows different users to work in different ways and allows the same user to switch between modalities as needed. It also preserves a clear distinction between actions and their presentations. The architecture is centered on the action model and its invariants, not on any single surface. This means spoken language is a first-class route through the same structure, instead of an after-the-fact description of another interface.

9. Consequences for the design

Designing Assistivox on top of SLIStack has several practical consequences.

First, it shapes how output is structured. If spoken interaction is native, then the system must structure its descriptions around semantics, content, context, and structural relationships, not only around visual layout. That affects how documents are summarized, how controls are described, and how navigation is presented.

Second, it constrains how new features are introduced. New capabilities must be added as command actions and hooks, and they must respect the invariants of state introspection and interruption parity. It should not be possible to add a feature that is visible and clickable but cannot be reached or explained through SLI.

Third, it provides a path for gradual evolution. Specifically, the set of command actions can grow, the lexicon can expand, and AI-assisted mapping can be added.

In this way, SLIStack serves as the design bridge between the theory of SLI and the concrete user interfaces of Assistivox. It gives a clear answer to how spoken language is central to the design, while leaving room for iteration on surfaces, language design, and models.