Document Clustering
via: Document Clustering with Python In this guide, I will explain how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time (per an IMDB list). See the original post for a more detailed discussion on the example. This guide covers: tokenizing and stemming each synopsis transforming the corpus into vector space using tf-idf calculating cosine distance between each document as a measure of similarity clustering the documents using the k-means algorithm using multidimensional scaling to reduce dimensionality within the corpus plotting the clustering output using matplotlib and mpld3 conducting a hierarchical clustering on the corpus using Ward clustering plotting a Ward dendrogram topic modeling using Latent Dirichlet Allocation (LDA) ...