Showing posts from April, 2016

Document Clustering


Document Clustering with Python In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original post for a more detailed discussion on the example. This guide covers: tokenizing and stemming each synopsistransforming the corpus into vector space using tf-idfcalculating cosine distance between each document as a measure of similarityclustering the documents using the k-means algorithmusing multidimensional scaling to reduce dimensionality within the corpusplotting the clustering output using matplotlib and mpld3conducting a hierarchical clustering on the corpus using Ward clusteringplotting a Ward dendrogramtopic modeling using Latent Dirichlet Allocation (LDA) Note that my github repo for the whole project is available. The 'cluster_analysis' workbook is fully…