Latent semantic indexing is a technique that projects

queries and documents into space with latent semantic

dimensions.

In the latent semantic space, a query and a document

are similar even if they don’t share any of the same

terms if their terms are semantically similar.

LSI is similarly metric to word overlap measures. LSI

has fewer dimensions than the original space and is a

method for dimensionality reduction.

This reduction takes a set of objects that exist in a

high-dimensional space and rearranges them and

represents them in a lower dimensional space instead.

They are often represented in two or three-dimensional

space just for the purpose of visualization. Latent

Semantic Indexing, or LSI is a mathematical

application technique sometimes known as singular

value decomposition.

The projection into the LSI space is chosen so that

the representations in the space of origin are changed

as little as possible. Then it is measured by the sum

of the squares of the difference.

There are several different mappings for latent

semantic indexing from high dimensional to low

dimensional spaces.

LSI chooses the optimal mapping in a sense that

minimizes the distance. Choosing the number of

dimensions is a unique problem.

A reduction can remove much of the noise while keeping

too few dimensions may lose important information. LSI

performance is improved considerably after ten to

twenty dimensions and peaks at seventy to one hundred

dimensions.

Then it slowly begins to diminish again. There is a

pattern of performance that is observed with other

datasets as well.