Unsupervised
Machine Learning

Unsupervised Machine Learning

  • Unsupervised Machine Learning

  • Topic Modeling with Talor Swift

  • The Missing Statistics Semester

Supervised Machine Learning

\[ \boldsymbol X = (X_1, X_2, \cdots, X_n)\mrTr \] where

\[ X_i = (x_{i1}, x_{i2}, \cdots, x_{in}) \] and

\[ \boldsymbol Y = (Y_1, Y_2, \cdots, Y_n)\mrTr \]

Unsupervised Machine Learning

Given:

\[ \boldsymbol X = (X_1, X_2, \cdots, X_n)\mrTr \] where

\[ X_i = (x_{i1}, x_{i2}, \cdots, x_{in}) \]

Group the data to \(K\) categories.

Unsupervised Machine Learning

We can naturally group data with the following techniques:

  • Principal Component Analysis
  • K-Means Clustering
  • Hierarchical Clustering
  • Mixture Models

Principal Componenet Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique used to simplify complex datasets while retaining as much of the original variability as possible. It accomplishes this by transforming the original variables into a new set of orthogonal variables called principal components. PCA is widely used in data analysis, visualization, and machine learning for tasks such as feature extraction, data compression, and noise reduction.

K-Means Clustering

K-Means clustering is one of the most popular unsupervised machine learning algorithms used for partitioning a dataset into a predetermined number of clusters. It aims to group similar data points together and discover underlying patterns or structures within the data. K-Means is simple, efficient, and widely applicable in various domains, including data analysis, image processing, and customer segmentation.

Hierarchical Clustering

Hierarchical clustering is a method used to cluster data into a hierarchy of clusters. Unlike K-Means, which requires specifying the number of clusters upfront, hierarchical clustering builds a tree-like structure (dendrogram) that reflects the relationships between data points at different levels of granularity. Hierarchical clustering can be divided into two main types: agglomerative and divisive.

Mixture Models

Mixture models for clustering, often referred to as Gaussian Mixture Models (GMMs), are probabilistic models used to describe the distribution of data as a mixture of multiple Gaussian distributions. Unlike K-Means or hierarchical clustering, which assign data points to discrete clusters, GMMs represent each cluster as a probability distribution over the entire feature space. This allows for more flexible modeling of complex data distributions and enables soft assignment of data points to clusters based on their probabilities.

Topic Modeling with Talor Swift

  • Unsupervised Machine Learning

  • Topic Modeling with Talor Swift

  • The Missing Statistics Semester

Topic Modelling

Topic modeling is a statistical technique used to identify latent topics or themes within a collection of text documents. It aims to uncover the underlying structure of the text data by automatically clustering documents into topics based on the distribution of words across documents. Topic modeling is widely used in natural language processing (NLP) and text mining for tasks such as document clustering, information retrieval, and content analysis.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a probabilistic model used for topic modeling in natural language processing (NLP). LDA assumes that each document in the corpus is generated by a probabilistic process involving a mixture of topics. It posits that documents exhibit multiple topics, and each word within a document is associated with one of these topics.

Structural Topic Model

The Structural Topic Model (STM) is an extension of the Latent Dirichlet Allocation (LDA) model that incorporates document metadata and covariates to capture the structural aspects of text data. Unlike LDA, which assumes that topics are generated independently of document metadata, STM allows for the incorporation of metadata or covariates associated with each document. Covariates could include document-level characteristics such as authorship, publication year, geographic location, or any other relevant metadata.

Text Mining Resource

R Packages

library(tidyverse)
library(taylor)
library(tidytext)
library(stm)

Data Cleaning

tidy_taylor <-
  taylor_album_songs |>
  unnest(lyrics) |> 
  unnest_tokens(word, lyric)


tidy_taylor |> 
  anti_join(get_stopwords()) |> 
  count(track_name, word, sort = TRUE) |> 
  head(4)
#> # A tibble: 4 × 3
#>   track_name                          word      n
#>   <chr>                               <chr> <int>
#> 1 Red (Taylor's Version)              red     107
#> 2 I Did Something Bad                 di       81
#> 3 I Wish You Would (Taylor's Version) wish     81
#> 4 Shake It Off (Taylor's Version)     shake    70
lyrics_sparse <-
  tidy_taylor |> 
  count(track_name, word) |> 
  filter(n > 3) |> 
  cast_sparse(track_name, word, n)

Topic Modelling

set.seed(123)
topic_model <- stm(lyrics_sparse, K = 8, verbose = FALSE)

Summary

summary(topic_model)
#> A topic model with 8 topics, 209 documents and a 909 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: i, was, you, the, it, like, and 
#>       FREX: red, isn't, snow, beach, him, was, too 
#>       Lift: between, hair, prayer, rare, sacred, stairs, wind 
#>       Score: red, snow, beach, him, was, isn't, there 
#> Topic 2 Top Words:
#>       Highest Prob: i, you, the, and, me, wanna, what 
#>       FREX: shake, wanna, wish, would, mm, bye, game 
#>       Lift: team, stephen, hide, fancy, tear, game, bye 
#>       Score: shake, wanna, wish, mm, off, fake, hung 
#> Topic 3 Top Words:
#>       Highest Prob: you, i, and, the, me, to, my 
#>       FREX: fly, left, losing, jump, go, someday, belong 
#>       Lift: shoulda, okay, ours, superstar, slope, lately, start 
#>       Score: la, times, fly, mean, bet, losing, smile 
#> Topic 4 Top Words:
#>       Highest Prob: the, i, we, in, you, and, of 
#>       FREX: woods, clear, huh, mine, car, getaway, walk 
#>       Lift: ready, shimmer, walk, checkin, mailbox, ridin, huh 
#>       Score: clear, woods, yet, daylight, out, walk, street 
#> Topic 5 Top Words:
#>       Highest Prob: oh, you, and, the, this, i, is 
#>       FREX: trouble, oh, rains, this, grow, asking, last 
#>       Lift: promises, sing, these, lovin, rest, usin, flew 
#>       Score: oh, last, trouble, asking, grow, rains, being 
#> Topic 6 Top Words:
#>       Highest Prob: you, the, ooh, i, and, ah, to 
#>       FREX: ha, starlight, ah, ooh, twenty, thing, whoa 
#>       Lift: bought, count, keeping, everyone's, humming, kitchen, push 
#>       Score: ooh, ha, ah, dorothea, starlight, twenty, you'll 
#> Topic 7 Top Words:
#>       Highest Prob: you, it, a, i, and, the, we 
#>       FREX: di, karma, blood, beautiful, wonderland, call, we've 
#>       Lift: deep, worship, sad, turns, felt, why's, boyfriend 
#>       Score: di, blood, karma, call, we've, hey, da 
#> Topic 8 Top Words:
#>       Highest Prob: you, i, to, the, me, been, and 
#>       FREX: york, welcome, mr, been, stay, i've, would've 
#>       Lift: guiding, caught, both, quite, beat, bright, closure 
#>       Score: york, welcome, stay, mr, would've, new, soundtrack

Plot

Code
lyrics_gamma <- tidy(
  topic_model, 
  matrix = "gamma",
  document_names = rownames(lyrics_sparse)
) 

lyrics_gamma |> 
  left_join(
    taylor_album_songs |> 
      select(album_name, document = track_name) |> 
      mutate(album_name = fct_inorder(album_name))
  ) |> 
  mutate(topic = factor(topic)) |> 
  ggplot(aes(gamma, topic, fill = topic)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +
  facet_wrap(vars(album_name)) +
  labs(x = expression(gamma))

Significant Effects

Code
set.seed(909)

effects <-
  estimateEffect(
    1:8 ~ album_name,
    topic_model,
    taylor_album_songs |> distinct(track_name, album_name) |> arrange(track_name)
  )


tidy(effects) |>  
  filter(term != "(Intercept)", p.value < 0.1) |> 
  select(topic, term, p.value)
#> # A tibble: 8 × 3
#>   topic term                                   p.value
#>   <int> <chr>                                    <dbl>
#> 1     1 album_nameMidnights                    0.0313 
#> 2     2 album_nameMidnights                    0.0781 
#> 3     3 album_nameFearless (Taylor's Version)  0.0205 
#> 4     3 album_namefolklore                     0.00472
#> 5     3 album_nameSpeak Now (Taylor's Version) 0.0242 
#> 6     3 album_nameTaylor Swift                 0.0289 
#> 7     7 album_nameFearless (Taylor's Version)  0.0475 
#> 8     7 album_nameSpeak Now (Taylor's Version) 0.0441

Topic 3

tidy(topic_model, matrix = "lift") |> 
  filter(topic == 3)
#> # A tibble: 909 × 2
#>    topic term     
#>    <int> <chr>    
#>  1     3 shoulda  
#>  2     3 okay     
#>  3     3 ours     
#>  4     3 superstar
#>  5     3 slope    
#>  6     3 lately   
#>  7     3 start    
#>  8     3 under    
#>  9     3 peace    
#> 10     3 lover    
#> # ℹ 899 more rows

The Missing Statistics Semester

  • Unsupervised Machine Learning

  • Topic Modeling with Talor Swift

  • The Missing Statistics Semester

The Missing Statistics Sememster

Here is a list of resources to expand on topics not covered in your education.

Adapted from https://missing.csail.mit.edu/

Introduction to Statistics

Statistical Computing

  • Computational Statistics (2009, Springer; Download from CSUCI Library)
  • Basic Elements of Computational Statistics (2017, Springer; Download from CSUCI)
  • Optimization (2013, Springer; Download from CSUCI)

Regression

  • Beyond Linear Regression
  • Regression Modelling Strategies (Download from CSUCI Library)
  • Linear Models
  • Generalized Linear Models With Examples in R (Download from CSUCI Library)
  • Linear and Generalized Linear Mixed Models and Their Applications (2nd Edition) (Download from CSUCI Library)
  • Vector Generalized Linear and Additive Models; Yee (Download from CSUCI Library)

Other Statistics Resources

R Programming

Python Programming

SQL

Shell-Terminal

Git

Markdown

Dashboards

Other Programming