Posts

Showing posts from April, 2016

Document Clustering

Image
via: http://brandonrose.org/clustering#K-means-clustering Document Clustering with Python In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See  the original post  for a more detailed discussion on the example. This guide covers: tokenizing and stemming each synopsis transforming the corpus into vector space using  tf-idf calculating cosine distance between each document as a measure of similarity clustering the documents using the  k-means algorithm using  multidimensional scaling  to reduce dimensionality within the corpus plotting the clustering output using  matplotlib  and  mpld3 conducting a hierarchical clustering on the corpus using  Ward clustering plotting a Ward dendrogram topic modeling using  Latent Dirichlet Allocation (LDA) Note that my  github repo  for the whole project is availab