Tracking How the Event Camera is Evolving
Event camera processing is advancing and enabling a new wave of neuromorphic technology.
Sony, Prophesee, iniVation, and CelePixel are already working to commercialize event (spike-based) cameras. Even more important, however, is the task of processing the data these cameras produce efficiently so that it can be used in real-world applications. While some are using relatively conventional digital technology for this, others are working on more neuromorphic, or brain-like, approaches.
Though more conventional techniques are easier to program and implement in the short term, the neuromorphic approach has more potential for extremely low-power operation.
By processing the incoming signal before having to convert from spikes to data, the load on digital processors can be minimized. In addition, spikes can be used as a common language with sensors in other modalities, such as sound, touch or inertia. This is because when things happen in the real world, the most obvious thing that unifies them is time: When a ball hits a wall, it makes a sound, causes an impact that can be felt, deforms and changes direction. All of these cluster temporally. Real-time, spike-based processing can therefore be extremely efficient for finding these correlations and extracting meaning from them.
Last time, on Nov. 21, we looked at the advantage of the two-cameras-in-one approach (DAVIS cameras), which uses the same circuitry to capture both event images, including only changing pixels, and conventional intensity images. The problem is that these two types of images encode information in fundamentally different ways.
Common language
Researchers at Peking University in Shenzhen, China, recognized that to optimize that multi-modal interoperability all the signals should ideally be represented in the same way. Essentially, they wanted to create a DAVIS camera with two modes, but with both of them communicating using events. Their reasoning was both pragmatic—it makes sense from an engineering standpoint—and biologically motivated. The human vision system, they point out, includes both peripheral vision, which is sensitive to movement, and foveal vision for fine details. Both of these feed into the same human visual system.
The Chinese researchers recently described what they call retinomorphic sensing or super vision that provides event-based output. The output can provide both dynamic sensing like conventional event cameras and intensity sensing in the form of events. They can switch back and forth between the two modes in a way that allows them to capture the dynamics and the texture of an image in a single, compressed representation that humans and machines can easily process.
These representations include the high temporal resolution you would expect from an event camera, combined with the visual texture you would get from an ordinary image or photograph.
They have achieved this performance using a prototype that consists of two sensors: a conventional event camera (DVS) and a Vidar camera, a new event camera from the same group that can efficiently create conventional frames from spikes by aggregating over a time window. They then use a spiking neural network for more advanced processing, achieving object recognition and tracking.
The other kind of CNN
At Johns Hopkins University, Andreas Andreou and his colleagues have taken event cameras in an entirely different direction. Instead of focusing on making their cameras compatible with external post-processing, they have built the processing directly into the vision chip. They use an analog, spike-based cellular neural network (CNN) structure where nearest-neighbor pixels talk to each other. Cellular neural networks share an acronym with convolutional neural networks, but are not closely related.
In cellular CNNs, the input/output links between each pixel and its eight nearest are built directly in hardware and can be specified to perform symmetrical processing tasks (see figure). These can then be sequentially combined to produce sophisticated image-processing algorithms.
Two things make them particularly powerful. One is that the processing is fast because it is performed in the analog domain. The other is that the computations across all pixels are local. So while there is a sequence of operations to perform an elaborate task, this is a sequence of fast, low-power, parallel operations.
A nice feature of this work is that the chip has been implemented in three dimensions using Chartered 130nm CMOS and Terrazon interconnection technology. Unlike many 3D systems, in this case the two tiers are not designed to work separately (e.g. processing on one layer, memory on the other, and relatively sparse interconnects between them). Instead, each pixel and its processing infrastructure are built on both tiers operating as a single unit.
Andreou and his team were part of a consortium, led by Northrop–Grumman, that secured a $2 million contract last year from the Defence Advanced Research Projects Agency (DARPA). While exactly what they are doing is not public, one can speculate the technology they are developing will have some similarities to the work they’ve published.
Shown is the 3D structure of a Cellular Neural Network cell (right) and layout (bottom left) of the John’s Hopkins University event camera with local processing.
In the dark
We know DARPA has strong interest in this kind of neuromorphic technology. Last summer the agency announced that its Fast Event-based Neuromorphic Camera and Electronics (FENCE) program granted three contracts to develop very-low-power, low-latency search and tracking in the infrared. One of the three teams is led by Northrop-Grumman.
Whether or not the FENCE project and the contract announced by Johns Hopkins university are one and the same, it is clear is that event imagers are becoming increasingly sophisticated.