Japan's National Institute of Advanced Industrial Science and Technology (AIST) developed a system that automatically determines who speaks when in a small meeting, records their speeches and produces proceedings that visualize the structure of the meeting.
The system includes a camera and a microphone to record images and sounds for analysis. There is no need to have each speaker wear a microphone. Users can easily search and view scenes by using keywords because a meeting is recorded as multimedia content with tag information indicating who said what and when.
AIST expects the system will be used to record group interviews for marketing researches, for example.
The new system consists of the following components.
(1) An input device combining an 8-directional microphone array and an omni-directional camera
(2) Software to localize and separate a sound source, recognize different voices and so on
(3) The "MArcBrowser" to browse the multimedia proceedings
Recording eight channels of audio signals with the microphone array, the system specifies the direction of a sound source at different times using a sound source localization technology. The system also detects "speech events" todetermine "who said what and when" by gathering speeches from the same direction (clustering) and determining the person who is delivering them.
At the same time, the system relates the speech events to the all-around panorama video shot with the omni-directional camera. Then it removes unnecessary sounds such as the other attendees' responses and the echo in the room overlapping the speech of the targeted person, using a sound separation technology.
Furthermore, the system derives keywords from the speech by voice recognition. It structurally organize the content of the meeting using those keywords as information (tag information) to be used when searching for a speech event. The data that have gone through all these processes are stored as multimedia content.
MArcBrowser is roughly composed of the following three windows.
(a) Panorama video shot with the omni-directional camera
(b) "Speech event map," a time-series graph in which each speaker's speech events are positioned
(c) List of extracted tag information
When browsing the content of meeting records using MArcBrowser, a video zoomed in on the face of the speaker is always running in the window (a), while the window (b) indicates which speech event is being played. A list of keywords that are often heard in the speeches is displayed in the window (c). If you click a keyword you are searching for, you can visually see how the keyword is distributed in the meeting.
AIST will conduct a verification test using the prototyped MArc system. It aims to license and commercialize the system, after improving the technology in light of the test results, the institute said. And AIST will continue its efforts to enhance the accuracy of voice recognition and speech search by combining the technology with its other elemental technologies.
AIST is planning to disclose the achievement of this development at AIST OpenLab, which will take place at AIST Tsukuba Central from Oct 20 to 21, 2008.

Nikkei Electronics Asia magazine is available each month free of charge to engineers, managers and other qualified readers.