 
						
					
					www.qualcomm.com
				
 
			 
		 
	 
(Remember the old one shot 2D image of a tiger then AKD1000 can recognise tigers?...............)
Accurate depth estimation across different modalities
Depth estimation and 3D reconstruction is the perception task of creating 3D models of scenes and objects from 2D images. Our research leverages input configurations including a single image, stereo images, and 3D point clouds. We’ve developed SOTA supervised and self-supervised learning methods for monocular and stereo images with transformer models that are not only highly efficient but also very accurate. Beyond the model architecture, our full-stack optimization includes using 
neural architecture search with DONNA (Distilling Optimal Neural Networks Architectures) and 
quantization with the 
AI Model Efficiency Toolkit (AIMET). As a result, we demonstrated the 
world’s first real-time monocular depth estimation on a phone that 
can create 3D images from a single image. Watch my 
3D perception webinar for more details.
Efficient and accurate 3D object detection
3D object detection is the perception task of finding positions and regions of individual objects. For example, the goal could be detecting the corresponding 3D bounding boxes of all vehicles and pedestrians on LiDAR data for autonomous driving. We are making possible efficient object detection in 3D point clouds. 
We’ve developed an efficient transformer-based 3D object detection architecture that utilizes 2D pseudo-image features extracted in the polar space. With a smaller, faster, and lower power model, we’ve achieved top accuracy scores in the detection of vehicles, pedestrians, and traffic signs on LiDAR 3D point clouds.
Low latency and accurate 3D pose estimation
3D pose estimation is the perception task of finding the orientation and key-points of objects. For XR applications, accurate and low-latency hand and body pose estimation are essential for intuitive interactions with virtual objects within a virtual environment. 
We’ve developed an efficient neural network architecture with dynamic refinements to reduce the model size and latency for hand pose estimation. Our models can interpret 3D human body pose and hand pose from 2D images, and our computationally scalable architecture iteratively improves key-point detection with less than 5mm error – achieving the best average 3D error.
3D scene understanding
3D scene understanding is the perception task of decomposing a scene into its 3D and physical components. We’ve developed the world’s first transformer-based inverse rendering for scene understanding. Our end-to-end trained pipeline estimates physically-based scene attributes from an indoor image, such as room layout, surface normal, albedo (surface diffuse reflectivity), material type, object class, and lighting estimation. Our AI model leads to better handling of global interactions between scene components, achieving better disambiguation of shape, material, and lighting. We achieved SOTA results on all 3D perception tasks and enable high-quality AR applications such as photorealistic virtual object insertion into real scenes.
	
	
	
		
		
		
		
	
	
Our method correctly estimates lighting to realistically insert objects, such as a bunny.
Click to see a larger image.
More 3D perception breakthroughs to come
Looking forward, our ongoing research in 3D perception is expected to produce additional breakthroughs in neural radiance fields (NeRF), 
3D imitation learning, neuro-SLAM (Simultaneous Localization and Mapping), and 3D scene understanding in RF (Wi-Fi/5G). In addition, our perception research is much broader than 3D perception as we continue to drive high-impact machine learning research efforts and invent technology enablers in several areas. We are focused on enabling advanced use cases for important applications, including XR, camera, mobile, autonomous driving, IoT, and much more. The future is looking bright as more perceptive devices become available that enhance our everyday lives.