Recurrent
Neural Networks
Time-Series, Document, and Audio Processing
Sequential Data
Recurrent Neural Networks
Document Classification
Time Series
R Code Document Classification
R Code Time-Series
Sequential data is data that is obtained in a series:
\[ X_{(0)}\rightarrow X_{(1)}\rightarrow X_{(2)}\rightarrow X_{(3)}\rightarrow \cdots\rightarrow X_{(J-1)}\rightarrow X_{(J)} \]
A stochastic process is a collection of random variables, that can be indexed by a parameters. Sequential data can be thought of as a stochastic process.
The generation of a variable \(X_{(j)}\) may or may not be dependent of the previous values.
Documents and Books
Temperature
Stock Prices
Speech/Recordings
Handwriting
Sequential Data
Recurrent Neural Networks
Document Classification
Time Series
R Code Document Classification
R Code Time-Series
Recurrent Neural Networks are designed to analyze input data that is sequential data.
An RNN can accounts for the position of a data point in the sequence as well as the distance it has to other data points.
Using the data sequence, we can predict and outcome \(Y\).
\[ \boldsymbol X = (\boldsymbol x_0, \boldsymbol x_1, \boldsymbol x_2, \cdots, \boldsymbol x_{J-1}, \boldsymbol x_J) \] where
\[ \boldsymbol x_{j} = (x_{j1},x_{j1}, \cdots, x_{jK}) \]
\[ h_{j} = f(\bbeta_{hx}\boldsymbol x_{j} + \bbeta_{hh}h_{j-1} + b_h) \]
\[ y_{j} = g(\bbeta_{hy}h_{j} + b_y) \]
Sequential Data
Recurrent Neural Networks
Document Classification
Time Series
R Code Document Classification
R Code Time-Series
Document Classification is the process of classifying documents to different categories.
A corpus refers to a large and structured collection of texts, typically stored electronically, and used for linguistic analysis, natural language processing (NLP), and other computational linguistics tasks.
The texts in a corpus can vary widely in nature, ranging from written documents, such as books, articles, and websites, to transcribed speech and social media posts
One-hot encoding converts categorical variables into a binary format where each category is represented by a binary vector. In this representation, each category corresponds to a unique index in the vector, and only one element in the vector is 1 (hot), while all other elements are 0 (cold).
The Bag of Words model is a simple and fundamental technique in natural language processing (NLP). It represents text data as a collection of words, disregarding grammar and word order. Each document is represented by a vector where each dimension corresponds to a unique word in the entire corpus, and the value indicates the frequency of that word in the document. This model is widely used for tasks like text classification and sentiment analysis.
A recurrent neural network can be used to account the sequential order of the words.
Sequential Data
Recurrent Neural Networks
Document Classification
Time Series
R Code Document Classification
R Code Time-Series
A time series is a sequence of data points collected or recorded at successive, evenly spaced intervals of time. These data points can represent any variable that is observed or measured over time, such as temperature readings, stock prices, sales figures, or sensor data.
\[ X_{(0)}\rightarrow X_{(1)}\rightarrow X_{(2)}\rightarrow X_{(3)}\rightarrow \cdots\rightarrow X_{(J-1)}\rightarrow X_{(J)} \]
A recurrent neural network can be used to account the sequential order of each measurement.
Sequential Data
Recurrent Neural Networks
Document Classification
Time Series
R Code Document Classification
R Code Time-Series
word_index <- imdb_train$vocabulary
decode_review <- function(text, word_index) {
word <- names(word_index)
idx <- unlist(word_index, use.names = FALSE)
word <- c("<PAD>", "<START>", "<UNK>", word)
words <- word[text]
paste(words, collapse = " ")
}
decode_review(imdb_train[1]$x[1:12], word_index)
#> [1] "<START> you know you are in trouble watching a comedy when the"
library(Matrix)
one_hot <- function(sequences, dimension) {
seqlen <- sapply(sequences, length)
n <- length(seqlen)
rowind <- rep(1:n, seqlen)
colind <- unlist(sequences)
sparseMatrix(i = rowind, j = colind,
dims = c(n, dimension))
}
# collect all values into a list
train <- seq_along(imdb_train) |>
lapply(function(i) imdb_train[i]) |>
purrr::transpose()
test <- seq_along(imdb_test) |>
lapply(function(i) imdb_test[i]) |>
purrr::transpose()
# num_words + padding + start + oov token = 10000 + 3
x_train_1h <- one_hot(train$x, 10000 + 3)
x_test_1h <- one_hot(test$x, 10000 + 3)
dim(x_train_1h)
#> [1] 25000 10003
#> [1] 0.01316756
model <- nn_module(
initialize = function(input_size = 10000 + 3) {
self$dense1 <- nn_linear(input_size, 16)
self$relu <- nn_relu()
self$dense2 <- nn_linear(16, 16)
self$output <- nn_linear(16, 1)
},
forward = function(x) {
x |>
self$dense1() |>
self$relu() |>
self$dense2() |>
self$relu() |>
self$output() |>
torch_flatten(start_dim = 1)
}
)
fitted <- model |>
fit(
# we transform the training and validation data into torch tensors
list(
torch_tensor(as.matrix(x_train_1h[itrain,]), dtype = torch_float()),
torch_tensor(unlist(train$y[itrain]))
),
valid_data = list(
torch_tensor(as.matrix(x_train_1h[ival, ]), dtype = torch_float()),
torch_tensor(unlist(train$y[ival]))
),
dataloader_options = list(batch_size = 512),
epochs = 5
)
maxlen <- 500
num_words <- 10000
imdb_train <- imdb_dataset(root = ".", split = "train", num_words = num_words,
maxlen = maxlen)
imdb_test <- imdb_dataset(root = ".", split = "test", num_words = num_words,
maxlen = maxlen)
vocab <- c(rep(NA, imdb_train$index_from - 1), imdb_train$get_vocabulary())
tail(names(vocab)[imdb_train[1]$x])
#> [1] "compensate" "you" "the" "rental" ""
#> [6] "d"
model <- nn_module(
initialize = function() {
self$embedding <- nn_embedding(10000 + 3, 32)
self$lstm <- nn_lstm(input_size = 32, hidden_size = 32, batch_first = TRUE)
self$dense <- nn_linear(32, 1)
},
forward = function(x) {
c(output, c(hn, cn)) %<-% (x |>
self$embedding() |>
self$lstm())
output[,-1,] |> # get the last output
self$dense() |>
torch_flatten(start_dim = 1)
}
)
Sequential Data
Recurrent Neural Networks
Document Classification
Time Series
R Code Document Classification
R Code Time-Series
model <- nn_module(
initialize = function() {
self$rnn <- nn_rnn(3, 12, batch_first = TRUE)
self$dense <- nn_linear(12, 1)
self$dropout <- nn_dropout(0.2)
},
forward = function(x) {
c(output, ...) %<-% (x |>
self$rnn())
output[,-1,] |>
self$dropout() |>
self$dense() |>
torch_flatten(start_dim = 1)
}
)