Unsupervised
Machine Learning
Unsupervised Machine Learning
Topic Modeling with Talor Swift
The Missing Statistics Semester
\[ \boldsymbol X = (X_1, X_2, \cdots, X_n)\mrTr \] where
\[ X_i = (x_{i1}, x_{i2}, \cdots, x_{in}) \] and
\[ \boldsymbol Y = (Y_1, Y_2, \cdots, Y_n)\mrTr \]
Given:
\[ \boldsymbol X = (X_1, X_2, \cdots, X_n)\mrTr \] where
\[ X_i = (x_{i1}, x_{i2}, \cdots, x_{in}) \]
Group the data to \(K\) categories.
We can naturally group data with the following techniques:
Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets while retaining as much of the original variability as possible. It accomplishes this by transforming the original variables into a new set of orthogonal variables called principal components. PCA is widely used in data analysis, visualization, and machine learning for tasks such as feature extraction, data compression, and noise reduction.
K-Means clustering is one of the most popular unsupervised machine learning algorithms used for partitioning a dataset into a predetermined number of clusters. It aims to group similar data points together and discover underlying patterns or structures within the data. K-Means is simple, efficient, and widely applicable in various domains, including data analysis, image processing, and customer segmentation.
Hierarchical clustering is a method used to cluster data into a hierarchy of clusters. Unlike K-Means, which requires specifying the number of clusters upfront, hierarchical clustering builds a tree-like structure (dendrogram) that reflects the relationships between data points at different levels of granularity. Hierarchical clustering can be divided into two main types: agglomerative and divisive.
Mixture models for clustering, often referred to as Gaussian Mixture Models (GMMs), are probabilistic models used to describe the distribution of data as a mixture of multiple Gaussian distributions. Unlike K-Means or hierarchical clustering, which assign data points to discrete clusters, GMMs represent each cluster as a probability distribution over the entire feature space. This allows for more flexible modeling of complex data distributions and enables soft assignment of data points to clusters based on their probabilities.
Unsupervised Machine Learning
Topic Modeling with Talor Swift
The Missing Statistics Semester
Topic modeling is a statistical technique used to identify latent topics or themes within a collection of text documents. It aims to uncover the underlying structure of the text data by automatically clustering documents into topics based on the distribution of words across documents. Topic modeling is widely used in natural language processing (NLP) and text mining for tasks such as document clustering, information retrieval, and content analysis.
Latent Dirichlet Allocation (LDA) is a probabilistic model used for topic modeling in natural language processing (NLP). LDA assumes that each document in the corpus is generated by a probabilistic process involving a mixture of topics. It posits that documents exhibit multiple topics, and each word within a document is associated with one of these topics.
The Structural Topic Model (STM) is an extension of the Latent Dirichlet Allocation (LDA) model that incorporates document metadata and covariates to capture the structural aspects of text data. Unlike LDA, which assumes that topics are generated independently of document metadata, STM allows for the incorporation of metadata or covariates associated with each document. Covariates could include document-level characteristics such as authorship, publication year, geographic location, or any other relevant metadata.
tidy_taylor <-
taylor_album_songs |>
unnest(lyrics) |>
unnest_tokens(word, lyric)
tidy_taylor |>
anti_join(get_stopwords()) |>
count(track_name, word, sort = TRUE) |>
head(4)
#> # A tibble: 4 × 3
#> track_name word n
#> <chr> <chr> <int>
#> 1 Red (Taylor's Version) red 107
#> 2 I Did Something Bad di 81
#> 3 I Wish You Would (Taylor's Version) wish 81
#> 4 Shake It Off (Taylor's Version) shake 70
#> A topic model with 8 topics, 209 documents and a 909 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: i, was, you, the, it, like, and
#> FREX: red, isn't, snow, beach, him, was, too
#> Lift: between, hair, prayer, rare, sacred, stairs, wind
#> Score: red, snow, beach, him, was, isn't, there
#> Topic 2 Top Words:
#> Highest Prob: i, you, the, and, me, wanna, what
#> FREX: shake, wanna, wish, would, mm, bye, game
#> Lift: team, stephen, hide, fancy, tear, game, bye
#> Score: shake, wanna, wish, mm, off, fake, hung
#> Topic 3 Top Words:
#> Highest Prob: you, i, and, the, me, to, my
#> FREX: fly, left, losing, jump, go, someday, belong
#> Lift: shoulda, okay, ours, superstar, slope, lately, start
#> Score: la, times, fly, mean, bet, losing, smile
#> Topic 4 Top Words:
#> Highest Prob: the, i, we, in, you, and, of
#> FREX: woods, clear, huh, mine, car, getaway, walk
#> Lift: ready, shimmer, walk, checkin, mailbox, ridin, huh
#> Score: clear, woods, yet, daylight, out, walk, street
#> Topic 5 Top Words:
#> Highest Prob: oh, you, and, the, this, i, is
#> FREX: trouble, oh, rains, this, grow, asking, last
#> Lift: promises, sing, these, lovin, rest, usin, flew
#> Score: oh, last, trouble, asking, grow, rains, being
#> Topic 6 Top Words:
#> Highest Prob: you, the, ooh, i, and, ah, to
#> FREX: ha, starlight, ah, ooh, twenty, thing, whoa
#> Lift: bought, count, keeping, everyone's, humming, kitchen, push
#> Score: ooh, ha, ah, dorothea, starlight, twenty, you'll
#> Topic 7 Top Words:
#> Highest Prob: you, it, a, i, and, the, we
#> FREX: di, karma, blood, beautiful, wonderland, call, we've
#> Lift: deep, worship, sad, turns, felt, why's, boyfriend
#> Score: di, blood, karma, call, we've, hey, da
#> Topic 8 Top Words:
#> Highest Prob: you, i, to, the, me, been, and
#> FREX: york, welcome, mr, been, stay, i've, would've
#> Lift: guiding, caught, both, quite, beat, bright, closure
#> Score: york, welcome, stay, mr, would've, new, soundtrack
lyrics_gamma <- tidy(
topic_model,
matrix = "gamma",
document_names = rownames(lyrics_sparse)
)
lyrics_gamma |>
left_join(
taylor_album_songs |>
select(album_name, document = track_name) |>
mutate(album_name = fct_inorder(album_name))
) |>
mutate(topic = factor(topic)) |>
ggplot(aes(gamma, topic, fill = topic)) +
geom_boxplot(alpha = 0.7, show.legend = FALSE) +
facet_wrap(vars(album_name)) +
labs(x = expression(gamma))
#> # A tibble: 8 × 3
#> topic term p.value
#> <int> <chr> <dbl>
#> 1 1 album_nameMidnights 0.0313
#> 2 2 album_nameMidnights 0.0781
#> 3 3 album_nameFearless (Taylor's Version) 0.0205
#> 4 3 album_namefolklore 0.00472
#> 5 3 album_nameSpeak Now (Taylor's Version) 0.0242
#> 6 3 album_nameTaylor Swift 0.0289
#> 7 7 album_nameFearless (Taylor's Version) 0.0475
#> 8 7 album_nameSpeak Now (Taylor's Version) 0.0441
Unsupervised Machine Learning
Topic Modeling with Talor Swift
The Missing Statistics Semester
Here is a list of resources to expand on topics not covered in your education.
Adapted from https://missing.csail.mit.edu/