LLMs, or large language models, have several shortcomings when it comes to modeling reality. One primary issue is that they are not grounded in physical or simulated reality. Intelligence, as discussed, often relies on being grounded in some form of reality, which language alone cannot fully capture. Language is described as an "approximate representation of our percepts and our mental models" 00:14:12.
Furthermore, LLMs lack key characteristics such as persistent memory, the ability to reason, and the ability to plan effectively 00:10:00. These models are not equipped to understand the physical world or perform complex reasoning and planning tasks, which are essential for truly modeling reality.
Another limitation is their inability to handle visual data or intuitive physics, which is necessary for common sense reasoning about the physical space 00:18:30. While there are some efforts to extend LLMs with vision capabilities, these are considered "hacks" and not trained end-to-end with video or in a manner that allows them to understand intuitive physics 00:18:25.
To address these shortcomings, it is suggested that different sensory data, such as visual representations of images, video, or audio, should be used alongside textual data. This integration could potentially allow LLMs to operate in a richer conceptual space and make better-informed decisions 00:17:15.
Overall, the gaps in LLMs' capabilities highlight the need for a more holistic approach that incorporates diverse sensory data to better model and understand reality.
Recommendations