The citadel of the brain is ripe for an assault. The functional principle that would allow us to emulate brain-like function in the computer is up for grabs.

The amount of experimental knowledge we have about the brain fills libraries, and even ten times that volume won’t force that functional principle on us. Science is an inverse process: you first invent a concept and then test it experimentally. As Einstein said, theory decides what can be measured. There is no systematic process to create theory from data, the role of experimental facts is to single out the correct theory.
The path to the creation of an appropriate conceptual framework is currently blocked by a faulty paradigm and it is time to overcome the current feeling of paralysis in the neurosciences that has resulted from decade-long stagnation. For that it needs a new generation with fresh courage and an unprejudiced perspective.

One promising line of attack to storm the citadel is to pick a typical function of the brain and emulate it in the computer. Vision is a prime candidate to serve in this role. We know a lot about it, both from the outside and the inside, it is deeply entangled with the rest of brain and mind, and a model of visual function that conforms with all known neural and cognitive observations is likely to be a gateway to the brain. Let me discuss a number of insights we may gain from that angle.

Computer Graphics as Model

A lot of knowledge about the visual world is encapsulated in computer graphics. Within a mere couple of decades that field has developed the ability to create photo-realistic image sequences. It was not an easy achievement at all, as our eye is extremely critical in discerning realism from artefact. Computer graphics had to play a kind of adversarial game, and in the process it generated an ontology of the visual world. Look at how images are created. The process starts with the generation of shapes in the form of wireframe models –curved surfaces composed of little polygons. These shapes are then draped with colored and textured surface markings, are projected under motion into the scene (taking care of mutual occlusion), the scene is projected into a virtual camera and finally it is illuminated and rendered by computing the intensity and color of light arriving at each pixel of the camera.

We learn from it that a rather compact basis (a typical graphics program in a computer game is, order of magnitude, of gigabyte size) suffices to create a virtual infinitude of visual scenes. This is made possible on the basis of compositionality: different components (surfaces, shapes, texture elements), aspects (shape, texture, position, pose, motion, illumination, …) and camera perspectives can be freely combined, leading to combinatorial richness. Thus, arbitrary images or textures can be draped over any shape, and the resulting objects can be projected by generic geometric and kinematic machinery into the virtual scene. The essential aspect is the mutual independence of components and aspects such that they can be freely combined.
The great power of discrimination with which our eye distinguishes artefact from realism in novel images means that our visual system compares the actual imagery with its own reconstructions. Our visual system thus must have two branches, one analytic (acting in inverse graphics mode, going from images to scene descriptions), the other synthetic (reconstructing the images, so to speak in graphics mode).

Vision is Compositional

The lesson to be learned from computer graphics is that visual scenes are to be understood as composed of parts and the parts as composed of aspects, that these aspects and components are structured by their own independent laws, and that there are many ways they can be combined. Thus, the computer graphic machinery for projecting objects into the virtual camera exists only once and can be applied to any object. Correspondingly, in order to realize vision, the machinery to achieve the inverse, projecting retinal images into the intrinsic space of object and scene representations, must be independent of the structure of objects. Similar remarks apply to the independence of other pairs of aspects.

Vision is an Inverse Problem

The proposal to solve vision by inverting computer graphics is not popular with my computer vision colleagues. While computer graphics is straightforward, vision is difficult as there is no systematic path from images to content: vision is an inverse problem. Inverse problems have an easy forward path but going back amounts to blind search (like guessing a PIN). As much as we know about physics and optics, scene content cannot be computed from the visual input by a formula. The main reason is that sensory signals are totally insufficient to describe the actual situation out there, retinal images being not much more than shadows on a cave wall. We are only tricked by the perfect performance of our own visual system into believing that what we see was contained in the images. The material, the color, the shape, the use of objects etc. etc. are not contained in any direct way in the input; 99 percent of the content we experience has to be brought up from memory. The sensory signals are sufficient to select, but not to construct, what we see. Vision like science is an inverse process.

As all visual images we get to see are novel in the sense of not being pixel-to-pixel replicas of anything we have seen before, interpretation of the visual input amounts to reconstructing the environment as an assembly of memory fragments and projection patterns. This compositional nature of scene structure is the source of both the strength and the difficulty of vision, or more generally of perception: its strength, as the combinations of components that explain the sensory input are unique, leaving no doubt as to their veracity; its difficulty, as finding those combinations threatens to be an intractable search problem.

The Source of Perceptual Certainty

The miracle that we can trust our perceptual reconstructions, that we can bet our life on the reliability of vision when navigating through the world, is due to the non-accidental nature of those reconstructions: if only we have rich enough evidence in terms of image clarity and active immersion into the scene, there is only one factual explanation of the sensory input.

This level of doubt-free perception can only be reached with the help of a sufficient complement of independent functional components, each incorporating its own selection criteria. As they say, it looks like a duck, swims like a duck, quacks like a duck, it is a duck.

Vision as Search

Thus, a combinatorial space needs to be searched: quite a number of aspects must be gotten right, including the positions and shapes of objects, their texture, coloring, illumination and motion, to create a convincing match. Intractability of search in combinatorial spaces is widely seen as the main roadblock of the classical artificial intelligence approach. How can the vision process be as efficient as it is?

Although vision is an inverse problem, it is not one of the hard variety (like guessing an n-digit PIN by n totally independent choices). After some visual experience, the different aspects are not totally independent any longer – information on one aspect typically entails restrictions on others. Thus, recognizing the type of a scene (as office, for example) restricts the choice of potential objects, or recognizing an object (and comparing its apparent size with its previously learned intrinsic size) restricts the range of distances, and so on. Solving the vision problem thus necessitates intimate integration of different aspects.

Computer vision has, in the course of its six decades of development, worked on about two dozen aspects, such as edge and contour finding, estimation of motion, stereo depth, shape from shading, structure from motion, segmentation, recognition of scene or object type, and so on, addressing each one of these with scores of dedicated studies. The reason computer vision is still not able to emulate that of animals let alone of humans is lack of integration of aspects into one coherent system. Only such integration would open the possibility of absorbing into memory aspect-spanning patterns and exploiting them during perceptual deciphering. What prevents integration of functional computer vision components into one coherent system is the disparate nature of data structures and algorithms independently developed for those functional components over decades without any regard for integration.

Aspect Integration needs a Unifying Data Structure

Integration of functional components is possible only if all of them are based on a common data structure and on a common set of basic operations, including learning. The brain’s neural code evidently conforms to that requirement, but present versions or interpretations evidently fall short of what is required – the general binary code used in classical computing and computer vision is too wide a framework, giving rise to endless search, while the neurons of artificial neural networks form a framework that is too narrow.

To incorporate the structures and processes of computer graphics (and of inverting it) a data structure is required that has the power to represent the compositional nature of structural components and their various relations within and between the different aspects and coordinate frames; a data structure, moreover, that is amenable to generic processes of organization, both on the time scale of perception and the time scale of learning. That neural code must express itself in the form of graphs or “net fragments” composed of nodes (corresponding to the meaning-bearing neurons of the brain) and links, the links being required to support compositionality.

The obvious mechanism to structure net fragments is the well-studied process of network self-organization. That process, acting on sets of simultaneously active neurons, converges to connectivity structures that are sparse (few connections per node) and self-consistent (in the sense of alternate pathways stabilizing each other). This process, network self-organization, shrink-wraps the neural data structure around the cognitive space in our mind. For the purposes of vision, prominent net fragments have the form of two-dimensional fields of neurons with short-range connections as well as neighborhood-preserving fiber projection patterns between such fields. These net fragments are like Lego blocks out of which visual representations can be composed.

The Vision Process

Let’s come back to the search problem of visual perception. A valid interpretation of the sensory input is tantamount to constructing a representation of the perceived scene that explains the input such that top-down signals are consistent with the sensory signals and do predict their changes. Thus, only if the visual depth of the objects in the scene has been estimated correctly, expected and real relative movements of the object images under ego-motion agree with each other. When opening the eyes, the sensory signals at first elicit a tremendous exuberance of neural responses due to the great ambiguity of those signals. The task of the system is then to collapse that exuberance to a tiny subset of active neurons that happen to form a valid net. In other words, the condition for a neuron to stay active after the first volley is that it is part of a net of which all neurons (or almost all neurons) were activated in the initial volley. If those net fragments are interpreted in analogy to Lego blocks, the problem of visual perception is then the problem of putting together an edifice of interlocking blocks that coherently fit together with each other and with the input.
This, then, is the way in which sub-system integration solves the search problem of visual perception. Of all possible neural firing patterns that could occur if neurons were independent, only a vanishing subset is actually supported by existing net fragments, that is, only few sets have been activated sufficiently often in the past to have undergone network self-organization. Within aspects, within for instance the texture subsystem or the surface depth profile subsystem, only very few patterns are stored as net fragments or can be composed of such (few in comparison to all possible patterns, but still many in absolute terms). Likewise between aspects: from all possible cross-aspect patterns a tiny minority is stored in the form of nets in memory. For instance, due to the rules of kinematics, only specific combinations of surface depth profile, image motion field and global object motion parameters can occur for rigid bodies; these kinematic rules are learned by infants in their early months and encoded in terms of net fragments.

The collapse of the original ambiguity of sensory signals, of the initial volley of exuberant activity, is not the effect of a sequential search for appropriate fragments but is due to the drop-out of all neurons that lack support in terms of intra-fragment signals and don’t have the benefit of being part of a net excited by the input. Only on the basis of its fully parallel action of all its neurons and synapses the brain can afford this act of first activating an exuberance of neurons and then taking away everything that is inappropriate, thus sculpting the final interpretation in analogy to a marble sculpture that is hewn out of a solid block. The process also is reminiscent of the measurement process in quantum mechanics, where a wave function that simultaneously expresses a large number of possibilities collapses under the influence of a measuring device to just one of them.

The Way Forward

In spite of Minsky’s optimism in the early days of AI, vision proved to be a hard problem to solve. Part of the difficulty was of course lack of processing power, an excuse quickly evaporating now. More fundamental barriers are deep-seated prejudices that are strangling the field. One of them is the algorithmic mode of thinking, traditionally relying on the programmer’s intelligence instead of that in the machine. Another is ignoring self-organization and the powerful set of a priori structural constraints that come with it. A third one is total reliance on passive absorption of statistical properties of sample input. And finally, and most deadly, is the prejudice implicit in the flawed neural code of artificial neural networks, a code that is incapable of representing mental content. Those prejudices are deeply rooted in subconscious layers where they cannot easily be reached by arguments. A paradigm change is in order, and tremendous opportunities wait for a new generation of unprejudiced researchers and entrepreneurs. As vision is an integral part of the brain as a whole there is good reason to believe that understanding vision means breaching the wall of the citadel, thus opening a totally new chapter in the history of mind.

The citadel of the brain is ripe for an assault. The functional principle that would allow us to emulate brain-like function in the computer is up for grabs.

The amount of experimental knowledge we have about the brain fills libraries, and even ten times that volume won’t force that functional principle on us. Science is an inverse process: you first invent a concept and then test it experimentally. As Einstein said, theory decides what can be measured. There is no systematic process to create theory from data, the role of experimental facts is to single out the correct theory.
The path to the creation of an appropriate conceptual framework is currently blocked by a faulty paradigm and it is time to overcome the current feeling of paralysis in the neurosciences that has resulted from decade-long stagnation. For that it needs a new generation with fresh courage and an unprejudiced perspective.

One promising line of attack to storm the citadel is to pick a typical function of the brain and emulate it in the computer. Vision is a prime candidate to serve in this role. We know a lot about it, both from the outside and the inside, it is deeply entangled with the rest of brain and mind, and a model of visual function that conforms with all known neural and cognitive observations is likely to be a gateway to the brain. Let me discuss a number of insights we may gain from that angle.

Computer Graphics as Model

A lot of knowledge about the visual world is encapsulated in computer graphics. Within a mere couple of decades that field has developed the ability to create photo-realistic image sequences. It was not an easy achievement at all, as our eye is extremely critical in discerning realism from artefact. Computer graphics had to play a kind of adversarial game, and in the process it generated an ontology of the visual world. Look at how images are created. The process starts with the generation of shapes in the form of wireframe models –curved surfaces composed of little polygons. These shapes are then draped with colored and textured surface markings, are projected under motion into the scene (taking care of mutual occlusion), the scene is projected into a virtual camera and finally it is illuminated and rendered by computing the intensity and color of light arriving at each pixel of the camera.

We learn from it that a rather compact basis (a typical graphics program in a computer game is, order of magnitude, of gigabyte size) suffices to create a virtual infinitude of visual scenes. This is made possible on the basis of compositionality: different components (surfaces, shapes, texture elements), aspects (shape, texture, position, pose, motion, illumination, …) and camera perspectives can be freely combined, leading to combinatorial richness. Thus, arbitrary images or textures can be draped over any shape, and the resulting objects can be projected by generic geometric and kinematic machinery into the virtual scene. The essential aspect is the mutual independence of components and aspects such that they can be freely combined.
The great power of discrimination with which our eye distinguishes artefact from realism in novel images means that our visual system compares the actual imagery with its own reconstructions. Our visual system thus must have two branches, one analytic (acting in inverse graphics mode, going from images to scene descriptions), the other synthetic (reconstructing the images, so to speak in graphics mode).

Vision is Compositional

The lesson to be learned from computer graphics is that visual scenes are to be understood as composed of parts and the parts as composed of aspects, that these aspects and components are structured by their own independent laws, and that there are many ways they can be combined. Thus, the computer graphic machinery for projecting objects into the virtual camera exists only once and can be applied to any object. Correspondingly, in order to realize vision, the machinery to achieve the inverse, projecting retinal images into the intrinsic space of object and scene representations, must be independent of the structure of objects. Similar remarks apply to the independence of other pairs of aspects.

Vision is an Inverse Problem

The proposal to solve vision by inverting computer graphics is not popular with my computer vision colleagues. While computer graphics is straightforward, vision is difficult as there is no systematic path from images to content: vision is an inverse problem. Inverse problems have an easy forward path but going back amounts to blind search (like guessing a PIN). As much as we know about physics and optics, scene content cannot be computed from the visual input by a formula. The main reason is that sensory signals are totally insufficient to describe the actual situation out there, retinal images being not much more than shadows on a cave wall. We are only tricked by the perfect performance of our own visual system into believing that what we see was contained in the images. The material, the color, the shape, the use of objects etc. etc. are not contained in any direct way in the input; 99 percent of the content we experience has to be brought up from memory. The sensory signals are sufficient to select, but not to construct, what we see. Vision like science is an inverse process.

As all visual images we get to see are novel in the sense of not being pixel-to-pixel replicas of anything we have seen before, interpretation of the visual input amounts to reconstructing the environment as an assembly of memory fragments and projection patterns. This compositional nature of scene structure is the source of both the strength and the difficulty of vision, or more generally of perception: its strength, as the combinations of components that explain the sensory input are unique, leaving no doubt as to their veracity; its difficulty, as finding those combinations threatens to be an intractable search problem.

The Source of Perceptual Certainty

The miracle that we can trust our perceptual reconstructions, that we can bet our life on the reliability of vision when navigating through the world, is due to the non-accidental nature of those reconstructions: if only we have rich enough evidence in terms of image clarity and active immersion into the scene, there is only one factual explanation of the sensory input.

This level of doubt-free perception can only be reached with the help of a sufficient complement of independent functional components, each incorporating its own selection criteria. As they say, it looks like a duck, swims like a duck, quacks like a duck, it is a duck.

Vision as Search

Thus, a combinatorial space needs to be searched: quite a number of aspects must be gotten right, including the positions and shapes of objects, their texture, coloring, illumination and motion, to create a convincing match. Intractability of search in combinatorial spaces is widely seen as the main roadblock of the classical artificial intelligence approach. How can the vision process be as efficient as it is?

Although vision is an inverse problem, it is not one of the hard variety (like guessing an n-digit PIN by n totally independent choices). After some visual experience, the different aspects are not totally independent any longer – information on one aspect typically entails restrictions on others. Thus, recognizing the type of a scene (as office, for example) restricts the choice of potential objects, or recognizing an object (and comparing its apparent size with its previously learned intrinsic size) restricts the range of distances, and so on. Solving the vision problem thus necessitates intimate integration of different aspects.

Computer vision has, in the course of its six decades of development, worked on about two dozen aspects, such as edge and contour finding, estimation of motion, stereo depth, shape from shading, structure from motion, segmentation, recognition of scene or object type, and so on, addressing each one of these with scores of dedicated studies. The reason computer vision is still not able to emulate that of animals let alone of humans is lack of integration of aspects into one coherent system. Only such integration would open the possibility of absorbing into memory aspect-spanning patterns and exploiting them during perceptual deciphering. What prevents integration of functional computer vision components into one coherent system is the disparate nature of data structures and algorithms independently developed for those functional components over decades without any regard for integration.

Aspect Integration needs a Unifying Data Structure

Integration of functional components is possible only if all of them are based on a common data structure and on a common set of basic operations, including learning. The brain’s neural code evidently conforms to that requirement, but present versions or interpretations evidently fall short of what is required – the general binary code used in classical computing and computer vision is too wide a framework, giving rise to endless search, while the neurons of artificial neural networks form a framework that is too narrow.

To incorporate the structures and processes of computer graphics (and of inverting it) a data structure is required that has the power to represent the compositional nature of structural components and their various relations within and between the different aspects and coordinate frames; a data structure, moreover, that is amenable to generic processes of organization, both on the time scale of perception and the time scale of learning. That neural code must express itself in the form of graphs or “net fragments” composed of nodes (corresponding to the meaning-bearing neurons of the brain) and links, the links being required to support compositionality.

The obvious mechanism to structure net fragments is the well-studied process of network self-organization. That process, acting on sets of simultaneously active neurons, converges to connectivity structures that are sparse (few connections per node) and self-consistent (in the sense of alternate pathways stabilizing each other). This process, network self-organization, shrink-wraps the neural data structure around the cognitive space in our mind. For the purposes of vision, prominent net fragments have the form of two-dimensional fields of neurons with short-range connections as well as neighborhood-preserving fiber projection patterns between such fields. These net fragments are like Lego blocks out of which visual representations can be composed.

The Vision Process

Let’s come back to the search problem of visual perception. A valid interpretation of the sensory input is tantamount to constructing a representation of the perceived scene that explains the input such that top-down signals are consistent with the sensory signals and do predict their changes. Thus, only if the visual depth of the objects in the scene has been estimated correctly, expected and real relative movements of the object images under ego-motion agree with each other. When opening the eyes, the sensory signals at first elicit a tremendous exuberance of neural responses due to the great ambiguity of those signals. The task of the system is then to collapse that exuberance to a tiny subset of active neurons that happen to form a valid net. In other words, the condition for a neuron to stay active after the first volley is that it is part of a net of which all neurons (or almost all neurons) were activated in the initial volley. If those net fragments are interpreted in analogy to Lego blocks, the problem of visual perception is then the problem of putting together an edifice of interlocking blocks that coherently fit together with each other and with the input.
This, then, is the way in which sub-system integration solves the search problem of visual perception. Of all possible neural firing patterns that could occur if neurons were independent, only a vanishing subset is actually supported by existing net fragments, that is, only few sets have been activated sufficiently often in the past to have undergone network self-organization. Within aspects, within for instance the texture subsystem or the surface depth profile subsystem, only very few patterns are stored as net fragments or can be composed of such (few in comparison to all possible patterns, but still many in absolute terms). Likewise between aspects: from all possible cross-aspect patterns a tiny minority is stored in the form of nets in memory. For instance, due to the rules of kinematics, only specific combinations of surface depth profile, image motion field and global object motion parameters can occur for rigid bodies; these kinematic rules are learned by infants in their early months and encoded in terms of net fragments.

The collapse of the original ambiguity of sensory signals, of the initial volley of exuberant activity, is not the effect of a sequential search for appropriate fragments but is due to the drop-out of all neurons that lack support in terms of intra-fragment signals and don’t have the benefit of being part of a net excited by the input. Only on the basis of its fully parallel action of all its neurons and synapses the brain can afford this act of first activating an exuberance of neurons and then taking away everything that is inappropriate, thus sculpting the final interpretation in analogy to a marble sculpture that is hewn out of a solid block. The process also is reminiscent of the measurement process in quantum mechanics, where a wave function that simultaneously expresses a large number of possibilities collapses under the influence of a measuring device to just one of them.

The Way Forward

In spite of Minsky’s optimism in the early days of AI, vision proved to be a hard problem to solve. Part of the difficulty was of course lack of processing power, an excuse quickly evaporating now. More fundamental barriers are deep-seated prejudices that are strangling the field. One of them is the algorithmic mode of thinking, traditionally relying on the programmer’s intelligence instead of that in the machine. Another is ignoring self-organization and the powerful set of a priori structural constraints that come with it. A third one is total reliance on passive absorption of statistical properties of sample input. And finally, and most deadly, is the prejudice implicit in the flawed neural code of artificial neural networks, a code that is incapable of representing mental content. Those prejudices are deeply rooted in subconscious layers where they cannot easily be reached by arguments. A paradigm change is in order, and tremendous opportunities wait for a new generation of unprejudiced researchers and entrepreneurs. As vision is an integral part of the brain as a whole there is good reason to believe that understanding vision means breaching the wall of the citadel, thus opening a totally new chapter in the history of mind.