Why do LLMs seem to fall short of AGI?

Multimodal, Grounded Learning

Causal Reasoning and Structured Knowledge

Curriculum Learning with Progressive Complexity

Interactive and Social Learning

Thoughts on Common Sense, or the lack Thereof

If common-sense reasoning is crucial for AGI, the training focus moves from text-heavy methods of current LLMs to a more dynamic, embodied, and multifaceted approach. Experts suggest strategies inspired by cognitive science, robotics, and hybrid AI models to develop this capability.

Here’s a rundown of what we think might work:

Multimodal, Grounded Learning:
- Idea: Common sense isn’t just linguistic—it’s tied to sensory experience and physical interaction. Training should integrate text with vision, sound, touch, and even simulated physics.
- How: Expose models to multimodal datasets (e.g., images paired with descriptions, videos with narration) and environments where they can “act” and observe consequences. Think of a robot learning that a glass of water spills if tipped, not just reading about it.
- Who’s Pushing It: Yann LeCun advocates for self-supervised learning across modalities, like his JEPA (Joint-Embedding Predictive Architecture) idea, where models predict outcomes in sensory-rich settings. DeepMind’s work on MuZero, blending reinforcement learning with perception, also fits here.
Embodied Interaction via Simulation or Robotics:
- Idea: Humans develop common sense by interacting with the world. AI needs a “body” (virtual or physical) to explore and learn cause-and-effect intuitively.
- How: Train models in high-fidelity simulations (e.g., AI2’s THOR or NVIDIA’s Isaac Sim) or real-world robotic setups where they manipulate objects and face real physics. For example, learning “fire is hot” by simulating heat or “stacking blocks can topple” by trial and error.
- Who’s Pushing It: Researchers like Josh Tenenbaum at MIT focus on intuitive physics models, while roboticists like Pieter Abbeel at UC Berkeley explore embodied learning for practical common sense.
Causal Reasoning and Structured Knowledge:
- Idea: Common sense requires understanding why things happen, not just that they do. Training should emphasize causal relationships over statistical correlations.
- How: Incorporate causal inference frameworks (e.g., Judea Pearl’s do-calculus) or knowledge graphs like ConceptNet into training, forcing models to reason about “if X, then Y” scenarios. Pair this with tasks that demand explanation, not just prediction.
- Who’s Pushing It: Pearl’s camp pushes hard here, as does Gary Marcus, who argues for hybrid systems blending neural nets with symbolic reasoning to encode causality explicitly.
Interactive and Social Learning:
- Idea: Humans refine common sense through social feedback and dialogue. AI could mimic this by learning from human critiques or collaborative tasks.
- How: Train models in interactive loops where humans correct mistakes (e.g., “No, ice melts in heat”) or pose “why” questions. Alternatively, use multi-agent systems where AIs negotiate and learn norms from each other.
- Who’s Pushing It: Anthropic’s interpretability work (e.g., from ex-OpenAI folks like Dario Amodei) leans this way, as does some of Google’s research on dialogue systems like LaMDA, though more for fluency than reasoning.
Curriculum Learning with Progressive Complexity:
- Idea: Start simple, then build up. Common sense emerges from mastering basic concepts (e.g., object permanence) before tackling abstract ones (e.g., social etiquette).
- How: Design a staged training program—first physics-based tasks (e.g., “things fall”), then relational reasoning (e.g., “birds fly”), then social scenarios (e.g., “people get mad if you’re late”). Reinforce with rewards for generalizing across contexts.
- Who’s Pushing It: This echoes developmental psychology influences, seen in work from Susan Carey’s collaborators at Harvard or DeepMind’s efforts to mimic human learning trajectories.
Scale Plus Smarts:
- Idea: Some argue common sense might still emerge from scale if paired with better architectures or tasks, not just more data.
- How: Keep scaling LLMs but tweak them with attention mechanisms for causality, memory modules for context retention, or fine-tuning on common-sense benchmarks like Winograd Schema or CommonsenseQA.
- Who’s Pushing It: The scaling optimists (e.g., OpenAI’s Ilya Sutskever historically) suggest this, though it’s less about a new program and more about refining the old one.

Consensus and Challenges

Most researchers agree that pure text-based training won’t cut it—common sense needs a world model, not just a word model. The debate is over how to get there: embodied learning is resource-intensive (robots and sims aren’t cheap), causal models are hard to scale, and multimodal datasets are messy. A hybrid approach—combining scale, simulation, and structured reasoning—seems promising to folks like Hassabis at DeepMind, who blend neuroscience insights with AI. The catch? No one’s cracked it yet, and metrics for “common sense” are fuzzy (e.g., passing tests vs. truly understanding).

The training program, then, might look like a mix: start with a simulated sandbox for physics and causality, layer in human feedback and multimodal data, and refine with tasks that demand generalization. It’s less about one magic dataset and more about a learning process mimicking how kids figure out the world—messy, active, and iterative.

SentiaSys