The recent popularity of ChatGPT has people rightfully wondering at what really constitutes Artificial General Intelligence (AGI). ChatGPT is remarkable, but clearly not generally intelligent. What would it take to build an AI that is?
The only generally-intelligent systems we know of (at least, with capabilities that we would find worthwhile to emulate) are humans. Unfortunately, we don’t have a perfectly clear consensus picture of how our minds actually work!
So let’s talk a bit about consciousness, and then about how we might build an AGI with that understanding.
Humans have this thing we call “consciousness”, which we can’t seem to do a very good job of describing, or imaginging how we might possibly implement in an AGI. Nevertheless, I think we can still sensibly talk about what consciousness accomplishes, and how we might replicate it.
I guess I’ll preface the rest here by saying I’m no expert on consciousness, I just have one. Anyway, here we go.
Various thoughts, proto-thoughts, and other sense impressions are constantly flying around at all levels of our conscious and subconscious (pre-conscious) experience. The subconscious is basically a turbulent pool of avaialble qualia that can become more fully realized if they rise to the level of consciousness. When you become aware of a thought - let’s say, imagine the smell of fresh-baked bread - that impression of the aroma first arises somewhere in the recesses of your mind, before getting progressively more “fleshed out” as it appears in your conscious experience.
What does that “fleshing out” actually mean?
I think consciousness is the primary place where “binding” happens. Binding is when multiple different qualia are “bound” together into a single impression. For example, when you imagine fresh-baked bread, you may concurrently experience that as:
When you imagine this aroma, or have any conscious experience, a collection of distinct phenomena are bound together into a single conscious experience, and this becomes part of a linearized narrative of experience, each conscious event occurring in sequence.
Consciousness, in this model, is a single locus where multiple salient events are brought together from the massively-parallel subconscious into a single-threaded bottleneck of linear experience.
There are a couple of interesting keys here that vaguely rhyme with current AI techniques, which we’ll get back to in a bit.
One thing that very clearly and objectively distinguishes consciousness from Transformer-based AIs is that we have multiple Fields of experience, derived from our various senses.
Our pre-conscious experience contains a bunch of different flavours of experience, including:
What we think of as “thoughts” are generally reducible to imagined sounds, visuals, and physical sensations.
The boundaries of these categories are sometimes a little fuzzy, and it’s possibly-sensible to include other minor senses such as proprioception and nociception as their own categories. The point, though, is that each conscious event for us is a binding of one or more phenomena from one or more of these sense fields.
Everything we’ve ever learned, as humans, we’ve learned through incoming impressions from our external senses. These get bound into impressions that appear in consciousness, and update our internal models. But… what does that mean?
If you want to spend a lot of brainpower on this part, go Google Karl Friston, who has written some immensely indecipherable but rewarding papers on this.
Basically, you can think of our brains as being driven mainly by a goal to minimize “surprisal”, which basically just means “surprise”. You want to look out at the world and have your mental model accurately reflect what you’re seeing.
You know that feeling when you wake up from a nap on the couch, or in a hotel room? Where you’re just coming to, and you haven’t opened your eyes yet, and you aren’t sure where you are? You’re experiencing a few subtle impressions that are failing to bind in the normal way, which would indicate no-surprise. Maybe your feet are elevated because they’re on the arm of the couch. The texture of the fabric isn’t right. There’s a breeze where there normally isn’t one. The ambient sound is wrong. But you woke up so you must be in your bed? This doesn’t work.
In this case, you are experiencing surprise. You attempt to bind, or integrate, these distinct phenomena into a single impression, but fail to. Your brain now begins the work of generating some other phenomena and trying to bind them with the external sense-impressions. A visual impression of the living room. Grasping for a memory of what happened before you went to sleep. Eventually enough of your brain comes back online to generate the right impression. Or maybe you open your eyes, and over the course of the next few hundred milliseconds, resolve all of that incoming light into a coherent impression of the room you’re in. But how did I get here? That dredges up other memories to bind into an even more coherent picture.
So that’s roughly how we make sense of the world around us, but how do we learn? This part is actually pretty similar to modern AI techniques, probably. When we experience surprise, which we then resolve into a clear understanding, we essentially feed back to say “hey, here’s how these impressions bind together”. This is definitely not exactly the same thing as Stochastic Gradient Descent, but it’s pretty close in broad strokes: that breeze? that fabric texture? it binds to that visual appearance, and the visual/auditory impression “living room”.
If you’re really familiar with Transformers, you might detect a loose analogy between our activity of binding phenomena and transformers’ “attention” feature.
So that’s how we make sense of the world, but we also act in the world, right? We move our bodies and we say things. How does that arise out of a brain that just seeks to minimize surprisal?
First off, you have to take a little bit of a leap into a leaky abstraction here: We have goals, like staying alive, reproducing, not being in physical pain, gaining status, and so on. You have to model these are “predictions”. When we bind a bunch of phenomena together and further bind them with a goal, and they mismatch with the goal, that’s a failed binding, and we want to reduce that divergence. We then act in the world to reduce that divergence. We can also act more or less subconsciously, or out of habit, because we’ve previously learned a particular experience->action mapping.
If someone asks you “hey, how are you?”, you’ll say something like “good, you?”. This is not likely motivated by major failure to bind experience consciously, so much as it’s actuated mostly pre-consciously. But, if someone says “hey, what’s your favourite variety of apple?”, you’ll experience a whole cascade of phenomena to try to satisfy the query with the right set of actions.
Similarly, if you’re sitting awkwardly and your leg falls asleep, you notice that the sensation in your leg is different (binding failure, resolution), then compare that against predictions that your leg shouldn’t be asleep, and this cascades into the volition to move.
So we have a handful of sense fields. Each of them has an external component mediated by a phsical sense organ (eyes, tongue, nose, skin, ears, …). Each also has an internal component where we can imagine or hallucinate impressions or memories of that sense.
Consciousness is the arena in which multiple phenomena from one or more sense fields are bound together into a unified conscious impression. While the pre-conscious experience is quite parallel, conscious experience is completely linear (in retrospect, at least).
Our brain’s primary activity is to learn to bind phenomena in a way that minimizes incidence of failed binding - to build models that seem to correctly integrate all the information coming in.
Action/volition is generated (subconsciously through learned habit, or) by failure to bind/integrate goals with other experienced phenomena.
AI models are currently applied to particular domains, and don’t tend to span them. Most of Natural Language Processing (NLP), for example, operates on text and correlates to some features from our auditory and visual systems (text, for most people, mixes these two). Computer Vision (CV), on the other hand, operates in a broader range of analogy to our visual field, but is mostly constrained there. Speech-to-text models such as Whisper operate in analogy to a broader swath of our auditory field as well as some of our visual (in generating encoded language, which is partially visual).
It’s interesting to dig into the Whisper example, in fact, because Speech-to-text is actually a good example of phenomenal binding. The model “hears” sounds, encodes them to an internal representation (analogy: eardrums, cochlea, etc.), processes them and in some sense binds them to a textual representation.
One thing worth noting is that AI systems tend to have “learning” mode and “inference” mode. In learning mode, the model is provided feedback after predictions and updates its internal state to better predict correctly next time. In inference mode, no correction is done. These stages are very clearly separated. Naturally-intelligent systems are necessarily always in a mixed learning/inference mode, and will integrate feedback to update their internal model when it’s avaialable. AI models are generally shipped in a “frozen” state where no more feedback is integrated. This mode is not avaialble for naturally-intelligent systems.
…but, it’s interesting to think again about what shape that “feedback” takes in naturally-intelligent systems. It’s not always as clear-cut as as being provided explicit feedback “that sound is a blue jay” / “no it’s a crow”. In fact, it’s never this simple, as in AI systems, the feedback is directly hardwired into the model update circuitry. We have to generate surprisal and resolve it by correctly rebinding a set of phenomena. We hear a sound, we predict “blue jay” (via hearing, a mental image of the blue jay, and an auditory/visual representation of “blue jay”), then someone tells us it’s a crow. We hear the word “crow”, picture the bird, bind it to the sound, and update our internal model, essentially updating our internal weights.
Note that, for us, feedback and model updating pretty much always involves binding multiple phenomena from distinct sense fields, both external and imagined. Current AIs don’t have this flexibility, and I think this is one of the big hurdles to clear to facilitate AGI.
Current AIs depend on carefully-distilled training examples that match the carefully-constructed sense fields we’ve built for them. NLP models have one sense field: text. We don’t actually have a sense field corresponding to text. It overlaps our auditory and visual fields, but only a small part of them. Reducing our rich internal representations of various phenomena to text is extremely lossy.
NLP AIs learn relationships between textual objects somewhat analogously to the way we learn relationships between our own phenomena, but the representation lacks richness.
Thinking about Whisper and speech-to-text models again: Imagine you have two audio recordings: One, a short clip from a battlefield, with explosions an gunfire in the background, and an tense voice saying “duck!”. In the second one, gently-running water is heard, some indiscernible conversation in the background, and a child excitedly yelling “duck!”. Fully understanding these two clips requires much more than five characters, but we have essentially no way to meticulously generate labels for this training data in a way that a supervised-learning model can integrate. Our internal experience isn’t reducible to anything we can communicate externally without extreme lossiness.
A human-like AGI would need multiple sense fields. It would certainly need external visual and auditory fields, as well as the capability to imagine or hallucinate impressions in them. It would need something resembling or in place of our physical sensation field, as this is where emotions are experienced, and they’re a primary driver of goal-directed behaviour. It’s likely this field could be much simpler than ours. It may have other fields. Simply integrating text without hacking it into the visual and auditory fields seems sensible. Goals would have to be represented natively in terms of the available sense fields.
Whatever sense fields a human-like AGI would have, it would need the ability to encode external phenomena as memories for later retrieval and binding. It would need to bind impressions from various fields into a coherent impression, and evaluate it for surprise. When surprise is found, it would need to update the weights used to choose the binding. When goals fail to integrate with a binding, it would need to evaluate available actions for a prediction that would cause a more satisfactory binding, and take that action.
There are a lot of unsolved problems there. It’s going to be an interesting century.