Terminology Reference: Vision-Language-Action (VLA) Systems
This document defines key terms used throughout Module 4 to ensure consistency across all chapters.
Core Terms
VLA (Vision-Language-Action)
An integrated system connecting vision (perception), language (cognition), and action (physical execution) in embodied robots.
Intent
The interpreted purpose or goal extracted from a natural language command, processed by the cognitive planner to generate action sequences.
Planner
An LLM-based component that translates high-level goals into executable action sequences considering physical constraints. The planner is distinct from controllers that execute actions.
Action Sequence
A series of executable steps generated by the cognitive planner to achieve a specific goal, respecting physical constraints and safety requirements.
Embodiment
The concept that AI systems behave differently when physically situated in the real world, as opposed to disembodied systems. Embodied systems must account for physical laws, sensor noise, actuator limitations, and safety considerations.
Cognitive Planning
The process by which LLMs function as high-level planners that generate task plans, rather than directly controlling robot actuators. This maintains separation between cognition and execution.
Voice-to-Action Pipeline
A processing chain converting spoken commands to robotic actions through speech recognition, intent mapping, and safety validation before action execution.
Autonomous Humanoid Architecture
The complete system design integrating perception → cognition → action flow for independent robot operation, incorporating all VLA system components.
Important Distinctions
- Planner vs. Controller: LLMs perform cognition and planning; ROS 2 performs execution
- Embodied vs. Disembodied: Physical constraints fundamentally alter how AI systems behave
- Conceptual vs. Implementation: Focus on system-level understanding rather than technical implementation details