Geoffrey Hinton's keynote talk details the new arXiv paper "How to represent part-whole hierarchies in a neural network" published on Feb 25. 2021.
Abstract: This paper does not describe a working system. Instead, it presents a single idea about representation which allows advances made by several different groups to be combined into an imaginary system called GLOM. The advances include transformers, neural fields, contrastive representation learning, distillation and capsules. GLOM answers the question: How can a neural network with a fixed architecture parse an image into a part-whole hierarchy which has a different structure for each image? The idea is simply to use islands of identical vectors to represent the nodes in the parse tree. If GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language
0:02:46 Three recent advances in neural networks
0:03:13 The psychological reality of the part-whole hierarchy and coordinate frames
0:03:35 The cube demonstration (Hinton 1979)
0:05:15 An arrangement of 6 rods
0:05:42 A different percept of the 6 rods
0:06:18 Alternative representations
0:06:54 A structural description of the "crown" formed by the six rods
0:07:38 A structural description of the "zig- zag"
0:08:06 A mental image of the crown
0:08:50 Why it is hard to make real neural networks learn part-whole hierarchies
0:10:28 A brief introduction to transformers
0:11:22 Standard convolutional neural network for refining word representations based on their context
0:13:16 How transformers work (roughly)
0:15:44 Neural net language modeling
0:17:36 A huge neural net that predicts the next word fragment using a big temporal context
0:17:57 Continuation by Open Al's GPT-2 neural network with 1.5 billion weights trained on a huge amount of text from the web:
0:18:53 Is 1.5 billion weights a big network?
0:19:45 A brief introduction to contrastive learning of visual representations
0:21:10 How SimCLR works
0:23:45 How good are the representations found by SimCLR?
0:24:35 A problem with contrastive learning of visual representations
0:25:39 Spatial coherence
0:27:21 Ways to represent part-whole hierarchies
0:29:21 A Biological Inspiration
0:30:36 The analogy with vision
0:31:17 Levels versus layers
0:35:26 The embedding vectors for a row of locations in a single mid-level layer of GLOM
0:37:31 Interactions between and within levels
0:41:06 How adjacent levels interact within each location
0:43:12 A problem with making an object vector the same at all locations in the object
0:44:08 A very simple example of an implicit function decoder
0:45:22 Top-down prediction of the parts of a face
0:46:13 The attention-weighted average
0:47:30 Deep end-to-end training
0:48:10 An extra term to make the bottom-up and top-down neural nets produce islands of similar predictions
0:49:09 Isn't it wasteful to replicate the object-level embedding vector for every location in an object?
0:50:09 Replicating object embeddings for every location is less expensive than you might think
0:52:00 The End