Cosine similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle in the interval $[0,0.5π)$ . It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in $[0,1]$ . The name derives from the term "direction cosine": in this case, unit vectors are maximally "similar" if they're parallel and maximally "dissimilar" if they're orthogonal (perpendicular). This is analogous to the cosine, which is unity (maximum value) when the segments subtend a zero angle and zero (uncorrelated) when the segments are perpendicular.

These bounds apply for any number of dimensions, and the cosine similarity is most commonly used in high-dimensional positive spaces. For example, in information retrieval and text mining, each term is notionally assigned a different dimension and a document is characterised by a vector where the value in each dimension corresponds to the number of times the term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.^[1]

The technique is also used to measure cohesion within clusters in the field of data mining.^[2]

The term cosine distance is often used for the complement in positive space, that is: $D_{C}(A,B)=1-S_{C}(A,B),$ where $D_{C}$ is the cosine distance and $S_{C}$ is the cosine similarity. It is important to note, however, that this is not a proper distance metric as it does not have the triangle inequality property—or, more formally, the Schwarz inequality—and it violates the coincidence axiom; to repair the triangle inequality property while maintaining the same ordering, it is necessary to convert to angular distance (see below).

One of the reasons cosine similarity is so popular is its efficiency in evaluate, especially for sparse vectors, as only the non-zero dimensions need to be considered.

Other, synonymic names of cosine similarity are coefficient of proportionality, Orchini similarity, and Tucker coefficient of congruence; Ochiai similarity (see below) is cosine similarity as applied to binary data.

Definition

The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula:

\mathbf {A} \cdot \mathbf {B} =\left\|\mathbf {A} \right\|\left\|\mathbf {B} \right\|\cos \theta

Given two vectors of attributes, A and B, the cosine similarity, $cos(θ)$ , is represented using a dot product and magnitude as

{\text{similarity}}=\cos(\theta )={\mathbf {A} \cdot \mathbf {B}  \over \|\mathbf {A} \|\|\mathbf {B} \|}={\frac {\sum \limits _{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum \limits _{i=1}^{n}{B_{i}^{2}}}}}},

where $A_{i}$ and $B_{i}$ are components of vector $A$ and $B$ respectively.

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison.

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

If the attribute vectors are normalized by subtracting the vector means (e.g., $A-{\bar {A}}$ ), the measure is called the centered cosine similarity and is equivalent to the Pearson correlation coefficient. For an example of centering, ${\text{if}}\,A=[A_{1},A_{2}]^{T},{\text{ then }}{\bar {A}}=\left[{\frac {(A_{1}+A_{2})}{2}},{\frac {(A_{1}+A_{2})}{2}}\right]^{T},$ so $A-{\bar {A}}=\left[{\frac {(A_{1}-A_{2})}{2}},{\frac {(-A_{1}+A_{2})}{2}}\right]^{T}.$

Angular distance and similarity

The term "cosine similarity" is sometimes used to refer to a different definition of similarity provided below. However the most common use of "cosine similarity" is as defined above and the similarity and distance metrics defined below are referred to as "angular similarity" and "angular distance" respectively. The normalized angle between the vectors is a formal distance metric and can be calculated from the similarity score defined above ^{[citation needed]}. This angular distance metric can then be used to compute a similarity function bounded between 0 and 1, inclusive.

When the vector elements may be positive or negative:

{\text{angular distance}}={\frac {\cos ^{-1}({\text{cosine similarity}})}{\pi }}

{\text{angular similarity}}=1-{\text{angular distance}}

Or, if the vector elements are always positive:

{\text{angular distance}}={\frac {2\cdot \cos ^{-1}({\text{cosine similarity}})}{\pi }}

{\text{angular similarity}}=1-{\text{angular distance}}

Although the term "cosine similarity" has been used for this angular distance, the term is used as the cosine of the angle only as a convenient mechanism for calculating the angle itself and is no part of the meaning. The advantage of the angular similarity coefficient is that, when used as a difference coefficient (by subtracting it from 1) the resulting function is a proper distance metric, which is not the case for the first meaning. However, for most uses this is not an important property. For any use where only the relative ordering of similarity or distance within a set of vectors is important, then which function is used is immaterial as the resulting order will be unaffected by the choice.

Confusion with "Tanimoto" coefficient

The cosine similarity may be easily confused with the Tanimoto metric, a specialised similarity coefficient with a similar algebraic form:

T(A,B)={A\cdot B \over \|A\|^{2}+\|B\|^{2}-A\cdot B}

In fact, this algebraic form was first defined by Tanimoto as a mechanism for calculating the Jaccard coefficient in the case where the sets being compared are represented as bit vectors. While the formula extends to vectors in general, it has quite different properties from cosine similarity and bears little relation other than its superficial appearance.

Ochiai coefficient

In biology, there is a similar concept known as the Ochiai coefficient (also known as the Ochiai-Barkman or Otsuka-Ochiai coefficient), which can be represented as such:^[3]^[4]

K={\frac {n(A\cap B)}{\sqrt {n(A)\times n(B)}}}

Here, $A$ and $B$ are sets, and $n(A)$ is the number of elements in $A$ . If sets are represented as bit vectors, the Ochiai coefficient can be seen to be the same as the cosine similarity.

Properties

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual $\|A-B\|$ , and observe that

\|A-B\|^{2}=(A-B)^{\top }(A-B)=\|A\|^{2}+\|B\|^{2}-2A^{\top }B

by expansion. When $A$ and $B$ are normalized to unit length, $\|A\|^{2}=\|B\|^{2}=1$ so this expression is equal to

2(1-\cos(A,B)).

That (squared here) Euclidean distance is called chord distance (because it is the lenght of the chord on the unit circle) and it is the Euclidean distance between the vectors which were normalized to unit sum of squared values within them.

Null distribution: For data which can be negative as well as positive, the null distribution for cosine similarity is the distribution of the dot product of two independent random unit vectors. This distribution has a mean of zero and a variance of $1/n$ (where $n$ is the number of dimensions), and although the distribution is bounded between -1 and +1, as $n$ grows large the distribution is increasingly well-approximated by the normal distribution.^[5]^[6] Other types of data such as bitstreams, which only take the values 0 or 1, the null distribution takes a different form and may have a nonzero mean.^[7]

Soft cosine measure

A soft cosine or ("soft" similarity) between two vectors considers similarities between pairs of features.^[8] The traditional cosine similarity considers the vector space model (VSM) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.

For example, in the field of natural language processing (NLP) the similarity among features is quite intuitive. Features such as words, n-grams, or syntactic n-grams^[9] can be quite similar, though formally they are considered as different features in the VSM. For example, words “play” and “game” are different words and thus mapped to different dimensions in VSM; yet it is obvious that they are semantically related. In case of n-grams or syntactic n-grams, Levenshtein distance can be applied (in fact, Levenshtein distance can be applied to words as well).

For calculating soft cosine, the matrix $s$ is used to indicate similarity between features. It can be calculated through Levenshtein distance, WordNet similarity, or other similarity measures. Then we just multiply by this matrix.

Given two $N$ -dimension vectors $a$ and $b$ , the soft cosine similarity is calculated as follows:

{\begin{aligned}\operatorname {soft\_cosine} _{1}(a,b)={\frac {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}b_{j}}{{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}a_{i}a_{j}}}{\sqrt {\sum \nolimits _{i,j}^{N}s_{ij}b_{i}b_{j}}}}},\end{aligned}}

where $s ij = similarity(feature i, feature j)$ .

If there is no similarity between features ( $s ii = 1$ , $s ij = 0$ for $i \neq j$ ), the given equation is equivalent to the conventional cosine similarity formula.

The time complexity of this measure is quadratic, which makes it perfectly applicable to real-world tasks but it can be reduced to subquadratic.^{[citation needed]}

References

^ Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
^ P.-N. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.
^ Ochiai A. Zoogeographical studies on the soleoid fishes found Japan and its neighboring regions. II // Bull. Jap. Soc. sci. Fish. 1957. V. 22. № 9. P. 526-530.
^ Barkman J.J. "Phytosociology and ecology of cryptogamic epiphytes, including a taxonomic survey and description of their vegetation units in Europe". – Assen. Van Gorcum. 1958. 628 p.
^ Spruill, Marcus C. (2007). "Asymptotic distribution of coordinates on high dimensional spheres". Electronic Communications In Probability. 12: 234–247. doi:10.1214/ECP.v12-1294.
^ "CrossValidated: Distribution of dot products between two random unit vectors in RD"
^ Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity". Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044.
^ Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David. "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.
^ Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana. Syntactic Dependency-based N-grams as Classification Features. LNAI 7630. pp. 1–11. ISBN 978-3-642-37798-3. Retrieved 7 October 2014.

External links

[1] Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.

[2] P.-N. Tan, M. Steinbach & V. Kumar, Introduction to Data Mining, Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.

[3] Ochiai A. Zoogeographical studies on the soleoid fishes found Japan and its neighboring regions. II // Bull. Jap. Soc. sci. Fish. 1957. V. 22. № 9. P. 526-530.

[4] Barkman J.J. "Phytosociology and ecology of cryptogamic epiphytes, including a taxonomic survey and description of their vegetation units in Europe". – Assen. Van Gorcum. 1958. 628 p.

[5] Spruill, Marcus C. (2007). "Asymptotic distribution of coordinates on high dimensional spheres". Electronic Communications In Probability. 12: 234–247. doi:10.1214/ECP.v12-1294.

[6] "CrossValidated: Distribution of dot products between two random unit vectors in RD"

[7] Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity". Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044.

[8] Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David. "Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model". Computación y Sistemas. 18 (3): 491–504. doi:10.13053/CyS-18-3-2043. Retrieved 7 October 2014.

[9] Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana. Syntactic Dependency-based N-grams as Classification Features. LNAI 7630. pp. 1–11. ISBN 978-3-642-37798-3. Retrieved 7 October 2014.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]