4

I am not sure if this is the right website to ask this question but I cant figure out where else to get the answer so, please, dont be mad :-)

As my bachelor thesis/project, I am trying to construct a short-scale search engine using TF-IDF measure. At the moment I am working on the construciton of my index. Usually term-document matrix is used where the rows represent terms (words) and columns are documents (webpages) on that pages. My question is - why is this better than using document-term matrix (this one would be just a transpostion of the term-document one.

It appears to me, that adding a new page to this matrix is more common operation and adding new row slightly easier than adding new column. I would be glad, if you could suggest me any article concerning this topic, because I have found none so far.

Thank you all your answers in advance!

Smajl
  • 716
  • 1
    If the mods think it appropriate, they might migrate your question to Cross Validated or Stack Overflow. – russellpierce Feb 15 '13 at 09:46
  • A term document matrix is just the transpose of a document term matrix. Or at least that's how most text mining software that I'm familiar with deals with it. – Brandon Bertelsen May 28 '16 at 04:42

1 Answers1

3

I think the answer here is going to be convention. The Term-Document matrix method I'm familiar with is called LSA (Latent Semantic Analysis). The data reduction techniques used singular value decomposition reduces the number of columns (documents) but keeps the number of rows (words). In the early stages of thinking about these things the identity of the document was far less important than the identities of the words.