Mitsubishi Electric’s development of scene perception interaction technology can provide drivers with natural and intuitive guidance

2024-01

According to foreign media reports, Mitsubishi Electric Corporation of Japan recently announced that it has developed its first technology capable of highly natural and intuitive interaction with humans. This technology is based on scene perception ability and can translate multimodal perception information into natural language. This new technology is called Scene Aware Interaction, which integrates Mitsubishi Electric's proprietary Maisart compact AI technology. It can analyze multimodal perceptual information and achieve highly natural and intuitive interaction with humans through natural language generated based on contextual context.

This technology is based on multimodal perceptual information, such as images and videos captured by cameras, audio information recorded by microphones, and positioning information measured by LiDAR, to identify objects in the context. In order to prioritize these different categories of information, Mitsubishi Electric has developed Attention Multimodal Fusion technology, which can automatically weight prominent single modal information and select appropriate vocabulary to accurately describe the scene. In benchmark testing using a universal test set, attention multimodal fusion technology used audio and visual information to achieve consensus based image description evaluation (CIDEr) scores, which were found to be 29% higher than those using only visual information. Mitsubishi Electric combines attention multimodal fusion with scene understanding technology and context based natural language generation technology to achieve a powerful end-to-end scene perception interaction system, which can achieve highly intuitive interaction with users in different scenarios.

Scene aware interaction technology can be used in automotive navigation applications to provide drivers with intuitive route navigation. For example, the system no longer instructs the driver to "turn right within 50 meters", but provides scene aware guidance, such as "turn right in front of the mailbox" or "follow the gray car to turn right". In addition, the system will generate a voice alarm when it predicts that the path of nearby objects intersects with the path of vehicles, such as "pedestrians crossing the road". To achieve this function, the system analyzes the scene, identifies hidden visual landmarks and dynamic elements in the scene, and then uses these things to recognize objects and events, generating intuitive sentence guidance for navigation.

The use of deep neural networks for object recognition, video description, natural language generation, and oral dialogue has made significant progress, enabling machines to better understand the surrounding environment and interact with humans more naturally and intuitively. Scene aware interaction technology is expected to have a wide range of applications, including human-machine interfaces for in vehicle information and entertainment systems, interaction with robots in buildings and factory automation systems, systems for monitoring human health conditions, monitoring systems for explaining complex scenes to humans, systems that encourage social distancing, and systems that support touchless operating devices in public places.