HopalongPetrovski
I'm Spartacus!
Yikes!Yes - it's above my pay grade and very heavy going.
In speech recognition, in some cases, the speech must first be converted to text although, it is also possible to work with phonemes.
The processor needs to understand the nature of the words:
Nouns
Verbs
Adjectives
Adverbs
Prepositions
Conjunctions
Articles
... then to comprehend the context.
One problem is discovering how far back you need to go to find the correct context.
In written language, a lot of context would be found in a single sentence. A paragraph would capture a lot more context. But the context. In a book, it may be necessary to recall a chapter to descry the context.
In normal speech, the context should be close at hand (or ear), unless it is a familiar term.
LSTM, Transformers, and Attention, not to mention LLMs, have come along in quick succession.
This 2021 paper gives an inkling of the complexity:
Thank you for Attention: A survey on Attention-based Artificial Neural Networks for Automatic Speech Recognition
Priyabrata Karmakar, Shyh Wei Teng, Guojun Lu
https://arxiv.org/pdf/2102.07259.pdf
There have been several attempts to find the optimal Attention system:
TABLE I
DIFFERENT TYPES OF ATTENTION MECHANISM FOR ASR
Name Short description
Global/Soft [10] At each decoder time step, all encoder hidden states are attended.
Local/Hard [23] At each decoder time step, a set of encoder hidden states (within a window) are attended.
Content-based [24] Attention calculated only using the content information of the encoder hidden states.
Location-based [25] Attention calculation depends only on the decoder states and not on the encoder hidden states.
Hybrid [11] Attention calculated using both content and location information.
Self [20] Attention calculated over different positions(or tokens) of a sequence itself.
2D [26] Attention calculated over both time-and frequency-domains.
Hard monotonic [27] At each decoder time step, only one encoder hidden state is attended.
Monotonic chunkwise [28] At each decoder time step, a chunk of encoder states (prior to and including the hidden state identified by the hard monotonic attention) are attended.
Adaptive monotonic chunkwise [29] At each decoder time step, the chunk of encoder hidden states to be attended is computed adaptively. models
This is a diagram of the configuration of a transformer-based encoder/decoder:
View attachment 57212
Unfortunately, the paper does not cover TeNNs.
It's getting to where there's only one chair left. Who will find a seat when the music stops?
I better rephrase my previous admonition to "I think I have a slight grasp of the basic concept as an abstract ideation" and thank God herself I am pretty.