Two Boundaries, Two Worlds

When people say that large language models don't "really" understand the world, the critique usually points in a clear direction: they lack embodiment. Without direct experience of the physical world, these systems learn the statistical structure of text but never learn innate physics by navigating a revolving door, misjudging the last stair in the dark, or crossing a border.

This critique hints at something interesting: different kinds of physical understanding exist, and LLMs master some forms while lacking others.

Consider a simple question: Where is the border between France and the UK?

A person who has travelled by train under the English Channel has one kind of understanding: they recall the immigration barrier at the station, the change in air pressure entering the tunnel, perhaps the loss of mobile signal. This is grounded in embodied interaction with what philosophers call bona fide boundaries [1], physical discontinuities that constrain action.

But someone who has never left their home country can still answer the question accurately. They've chatted about holiday trips with neighbours, seen it on maps, read about its history. Their understanding is mediated entirely by representation, what is called a fiat boundary, a human-imposed division that exists because societies collectively treat it as real.

LLMs have access to this second kind of knowledge. But crucially, we do not say that the person who learnt geography from books "doesn't understand" where the border lies. So why do we say it about models? 

The answer lies in distinguishing physical-causal grounding, formal-structural reasoning, and linguistic-normative structure.

LLMs excel at linguistic structure. They know that borders separate states, that containers have insides, that walls block motion. But recent research shows this competence is shallower than it appears. On tasks requiring formal structural constraints such as mereology (part–whole relations), containment, adjacency, and boundary tracking, LLMs can be inconsistent even when no physical reasoning is required. 

Results on benchmarks like ConceptARC [2] (e.g. move/extend to boundary) show systematic failures in maintaining stable ontological relations across multi-step reasoning. (I should note however that there are no benchmarks I could find that specifically test the fiat/bona fide boundary distinction central to this discussion). This reveals a nuance: LLMs have strong lexical structural knowledge, but inconsistent formal structural knowledge. They can describe structures but cannot reliably maintain them.

With bona fide boundaries, the failures become more dramatic. Without sufficient raw linguistic data, models might make impossible predictions about object collisions or containment. As summarised by Melnik et al. [3], benchmarks for physical reasoning show that many current AI systems, including LLM-based approaches, do not yet exhibit robust predictive causal models that generalise reliably to novel scenarios.  Even advanced video generation models like Sora exhibit ‘persistent physical-law failures’ [5] for example, inaccurate gravity or fluid-like substances behaving weirdly.

That said, large-scale video models are improving rapidly. Early versions showed clear violations of physics; newer ones display increasingly coherent object dynamics. They remain imperfect, and it is still uncertain whether scaling video alone will be enough to produce fully stable boundary-aware predictions, but the trend is unmistakably upward.

Humans acquire this understanding through action: pouring water to the top of a glass, reaching the end of the lane in a swimming pool. These interactions teach not just vocabulary but the causal consequences of boundaries.  LLMs don’t have access to directly grounded experiences. 

On many text-based tasks involving fiat boundaries however, frontier models can match or even exceed non-expert humans. Legal categories, organizational charts, jurisdictional lines, diagnostic criteria - these boundaries are linguistic and normative, built from collective agreement and sustained through text. Because their logic is encoded in discourse, LLMs can internalize them [4].

Humans, however, do something LLMs do not: we participate in fiat boundaries. We feel social pressure, anticipate sanctions, and maintain expectations about what others believe. Fiat boundaries require normative commitment, an understanding that "this line matters because we all act as if it does." LLMs simulate this discourse but do not inhabit it.

If LLMs struggle with bona fide boundaries, where might progress come from? Recent industrial trends highlight the need for world models, with a pivot towards hybrid architectures combining language models with explicit physics simulators, formal reasoning engines, and embodied sensory systems [5]. A good example is V-JEPA 2 [6], which first learns a video-only world model from internet-scale footage and then adds a small action-conditioned module that lets a real robot arm plan grasping and picking up trajectories in new environments. 

This reflects a broader recognition: linguistic competence alone is not enough: when tasks demand physical or formal constraint-satisfaction - robotics, self-driving simulation, engineering - pure LLMs risk breaking down. Video-trained world models by contrast begin to encode bona fide boundaries that can actually guide action.

We still don’t know whether current architectures will ever inhabit the same layered world of boundaries that humans do. Today’s models can navigate many fiat boundaries in text, and systems like V-JEPA 2 show how bona fide boundaries can be learned and used to guide real-world action. But these strands are only beginning to touch: a robot that understands where the cup ends and the table begins is still driven by a world model that is only loosely coupled to the language model that talks about it. In other words they are trained as separate systems rather than a single agent that jointly reasons about action and language.

The next generation of ‘omni-models’ is a bet that linguistic structure and visual regularities together will capture both bona fide and fiat boundaries. Navigating the integration of modalities is an open question for now.

References 

[1] Smith, B., & Varzi, A. C. (2000). Fiat and bona fide boundaries. Philosophical and Phenomenological Research, 401-420. 

[2] Mitchell, M., Palmarini, A. B., & Moskvichev, A. (2023). Comparing humans, gpt-4, and gpt-4v on abstraction and reasoning tasks. arXiv preprint arXiv:2311.09247. 

[3] Melnik, A., Schiewer, R., Lange, M., et al. (2023). Benchmarks for physical reasoning AI. arXiv preprint arXiv:2312.10728.

[4] Katz, D. M., Bommarito, M. J., Gao, S., & Arredondo, P. (2024). GPT-4 passes the bar exam. Philosophical Transactions of the Royal Society A, 382(2270), 20230254.

[5] Ding, J., Zhang, Y., Shang, Y., et al. (2025). Understanding world or predicting future? A comprehensive survey of world models. ACM Computing Surveys, 58(3), 1-38.

[6] Assran, M., Bardes, A., Fan, D., et al. (2025). V-JEPA 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985.

Acknowledgements

Many thanks to Ehsan Shareghi and Fangyu Liu for their helpful comments on an earlier version of this article.

Next
Next

Are Doubt and Uncertainty the Same Thing?