Hi Dr E,
I too share your confusion about Cerence.
However, there are some points which mitigate my concerns somewhat.
A few weeks ago, Mercedes started talking about MBUX not needing the "Hey Mercedes" wake up when there was only one person in the car, the corollary being that they still need it when there are two or more people.
Another thing is I've looked at Cerence patents, and while they discuss the use of NNs, they do not describe or claim any NN circuitry.
As you say, Mercedes found Akida to be 5 to 10 times beter than other systems for "Hey Mercedes". They also used "Hey Mercedes" as an example of what Akida could do, and appeared to make reference to plural uses of Akida.
On top of that, Mercedes also stated their desire to standardize on the chips they use. Akida is sensor agnostic.
Then there's Valeo Scala 3 lidar due out shortly, which I think may contain Akida, leaving aside Luminar with their foveated lidar and who have stated that they expect to expand their cooperation with Mercedes from mid-decade. MB used Scala 2 to obtain Level 3 ADAS certification, (sub-60 kph), while Scala 3 is rated to 160 kph.
Luminar, like Cerence, talk about using AI, but do not describe its construction.
Standardizing on Akida would improve the efficiency of the MB design office as their engineers would all be signing off the same hymn sheet in close harmony.
This patent application shows an acoustic classifier 152, a function which could be performed by Akida.
The combined classifier may be used with the 3 classifier inputs for the context discrimination to decide if the speaker is talking to the car or in conversation.
US2022343906A1 FLEXIBLE-FORMAT VOICE COMMAND
[0042] As introduced above, the reasoner 150 processes both text output 115 and the audio signal 105 . The audio signal 105 is processed by an acoustic classifier 152 . In some implementations, this classifier is a machine learning classifier that is configured with data (i.e., from configuration data 160 ) that was trained on examples of system-directed and of non-system directed utterance by an offline training system 180 . In some examples, the machine-learning component of the acoustic classifier 152 receives a fixed-length representation of the utterance (or at least the part of the utterance received to that point) and outputs a score (e.g., probability, log likelihood, etc.) that represents a confidence that the utterance is a command. For example, the machine-learning component can be a deep neural network. Note that such processing does not in general depend on any particular words in the input, and may instead be based on features such as duration, amplitude, or pitch variation (e.g., rising or falling pitch). In some implementations, the machine-learning component processes a sequence, for example, processing a sequence of signal processing features (e.g., corresponding to fixed-length frames) that represent time-local characteristics of the signal, such as amplitude, spectral, and/or pitch, and the machine-learning component processes the sequence to provide the output score. For example, the machine learning component can implement a convolutional or recurrent neural network.
[0051] In situations in which the reasoner 150 determines that an utterance is a system-directed command directed to a particular assistant, it sends a reasoner output 155 to one of the assistants 140 A-Z with which the system 100 is configured. As an example, assistant 140 A includes a natural language understanding (NLU) 120 , whose output representing the meaning or intent of the command is passed to a command processor 130 , which acts on the determined meaning or intent.
[0052] Various technical approaches may be used in the NLU component, including deterministic or probabilistic parsing according to a grammar provided from the configuration data 160 , of machine-learning based mapping of the text output 115 to a representation of meaning, for example, using neural networks configured to classify the text output and/or identify particular words as providing variable values (e.g., “slot” values) for identified commands. The NLU component 120 may provide an indication of a general class of commands (e.g., a “skill”) or a specific command (e.g., and “intent”), as well as values of variables associated with the command. The configuration of the assistant 140 A may use configuration data that is determined using a training procedure and stored with other configuration data in the configuration data storage 160 .
I'm guessing that Akida 2 could be used in NLU 120.
Last edited: