Jaccard Similarity vs Cosine Similarity

https://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

Jaccard Similarity is given by

s_{i j} = \frac{p}{p + q + r}

where,

p = # of attributes positive for both objects
q = # of attributes 1 for i and 0 for j
r = # of attributes 0 for i and 1 for j

Whereas, cosine similarity =

\frac{A \cdot B}{‖ A ‖ ‖ B ‖}

where A and B are object vectors.

Simply put, in cosine similarity, the number of common attributes is divided by the total number of possible attributes. Whereas in Jaccard Similarity, the number of common attributes is divided by the number of attributes that exists in at least one of the two objects.

And there are many other measures of similarity, each with its own eccentricities. When deciding which one to use, try to think of a few representative cases and work out which index would give the most usable results to achieve your objective.

The Cosine index could be used to identify plagiarism, but will not be a good index to identify mirror sites on the internet. Whereas the Jaccard index, will be a good index to identify mirror sites, but not so great at catching copy pasta plagiarism (within a larger document).

When applying these indices, you must think about your problem thoroughly and figure out how to define similarity. Once you have a definition in mind, you can go about shopping for an index.

Edit: Earlier, I had an example included in this answer, which was ultimately incorrect. Thanks to the several users who have pointed that out, I have removed the erroneous example.

Search This Blog

Tip Top Code

Jaccard Similarity vs Cosine Similarity

Comments

Post a Comment