An Overview of Latent Semantic Indexing

Latent semantic indexing is a technique that projects
queries and documents into space with latent semantic
dimensions.

In the latent semantic space, a query and a document
are similar even if they don’t share any of the same
terms if their terms are semantically similar.

LSI is similarly metric to word overlap measures. LSI
has fewer dimensions than the original space and is a
method for dimensionality reduction.

This reduction takes a set of objects that exist in a
high-dimensional space and rearranges them and
represents them in a lower dimensional space instead.

They are often represented in two or three-dimensional
space just for the purpose of visualization. Latent
Semantic Indexing, or LSI is a mathematical
application technique sometimes known as singular
value decomposition.

The projection into the LSI space is chosen so that
the representations in the space of origin are changed
as little as possible. Then it is measured by the sum
of the squares of the difference.

There are several different mappings for latent
semantic indexing from high dimensional to low
dimensional spaces.

LSI chooses the optimal mapping in a sense that
minimizes the distance. Choosing the number of
dimensions is a unique problem.

A reduction can remove much of the noise while keeping
too few dimensions may lose important information. LSI
performance is improved considerably after ten to
twenty dimensions and peaks at seventy to one hundred
dimensions.

Then it slowly begins to diminish again. There is a
pattern of performance that is observed with other
datasets as well.