14.18 LSTM

long short term memory is a RNN architecture that is widely used in Deep Learning.
it is a variation that solves issues like vanishing and exploding gradient with RNN
LSTM excels at capturing long term deps making it ideal for sequence prediction tasks
LSTM uses feedback connections allowing it to process entire sequences of data not just data points
its effective for understanding and predicting pattterns in sequential data like time series , text and speech.
LSTM recurrent units tries to remember all the past knowledge that the network has seen so far and to forget irrelevant data

it is literally made to be a better RNN:
RNN cannot do parallellized
context is only computed from history
there is no distinction between short and long term mem, mem is just memory
training is tricky
suffer from vanishing gradient and exploding gradient
cannot process very long sequences effectively
LSTM was developed to handle long and range dependencies better than RNNs and solve the gradient issues LSTMs solve this by introducing different activation function layers called "gates" for different purposes

at a high level LSTM works like RNN cell but with internal functioning divided into three parts
- Choosing whether information from previous timestamp is to be remembered or forgotten
- Learning new information from current input
- Passing the updated info to the next timestamp
LSTM architecture has a chain struct that contains four neural networks and different memory blocks called cells
info retained by these cells are manipulated using gates
there are three gates:
forget gate: determines what info to be removed
input gate : add useful info to the currrent cell state
output gate: controls what info is output from mem cell and extracts useful information from current cell state
Bidirectional LSTM: a variation of standard LSTM that processes sequential data in both forward and backward directions
these are made of actually 2 LSTM one that goes front and one that goes back but their outputs are connected
they have state of the art performance in tasks like machine translation speech recognition and text summarization

Language modeling learn dependencies between words and generate coherent and grammatically correct sentences
Speech recognition used for transcribing text and interpreting spoken commands
Time series forcasting: used to predict stock prices weather and energy consumption
Anomaly Detection: used for detecting fraud or network intrusions
Recommender System: used in tasks like suggesting movies, music and books by learning user behavior patterns