Tip Top Code

Posts

Showing posts from April, 2016

Document Clustering

April 20, 2016

via: http://brandonrose.org/clustering#K-means-clustering Document Clustering with Python In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original post for a more detailed discussion on the example. This guide covers: tokenizing and stemming each synopsis transforming the corpus into vector space using tf-idf calculating cosine distance between each document as a measure of similarity clustering the documents using the k-means algorithm using multidimensional scaling to reduce dimensionality within the corpus plotting the clustering output using matplotlib and mpld3 conducting a hierarchical clustering on the corpus using Ward clustering plotting a Ward dendrogram topic modeling using Latent Dirichlet Allocation (LDA) ...