6

I'd like to measure how much information a document $D$ contains.

Clearly, the New York Times published yesterday contains more information than my diary wrote on the same day. But, I do not know how to quantify those differences.

I think there are at least two alternatives. Those are the information entropy, and tf-idf.

In general, the information entropy $H$ is employed to measure the information. Basically the information entropy seems OK to measure document information.

For instance, let's compare two documents: $D_1 = \text{"Tom loves Mary. Tom loves Mary"}$; $D_2 = \text{"Tom loves Mary. Jack loves Jane."}$. In this case, clearly the information of $D_1$ is less than $D_2$ and $H(D_1) < H(D_2)$ holds.

The tf-idf is the second option. With tf-idf, more rare words are regarded as more informative words. This also sounds valid. In fact, tf-idf is used to measure the importance of documents in automatic document summarization.

Then, my questions are

  • Are there standard way to measure the information of a document?
  • Are The information entropy and tf-idf used for this purpose? Why or why not?

Update on Aug 17

Thanks to several kind comments, I came to clarify what my question really was. I'd like to know formal (mathematical) definition of what is information or informative in the "NYT-and-diary comparison".

Intuitively, the New York Times is more helpful than my diary in order to get valuable information. In fact, many people pay $3 for NYT, and most people do not for my diary.

However, formally (mathematically), I can not explain why NYT is more informative.

Hence, my question is equivalent to how to formally define information or informative in the NYT-diary case. Then, the questions in the comments like "what do you mean by information" is totally what I'd like to know :)

Light Yagmi
  • 259
  • 1
  • 7

1 Answers1

4

Assuming that your data comes from a Markovian source, you can estimate the entropy of the source using an optimal compression algorithm such as Lempel–Ziv, whose theoretical version (without limiting the table size) is known to asymptotically converge to the entropy. That is, if the entropy of the source (suitable defined) is $H$, then the expected compressed size of $n$ samples is roughly $nH$. Definitions and proofs appear in Cover and Thomas' Elements of Information Theory, Chapter 13 of the 2nd edition.

The entropy of the source differs from your example. It doesn't make sense to calculate the entropy of a single output – entropy is a function of a random variable (or a distribution), and the entropy of a constant random variable is zero. Instead, we consider your source text as a random variable, and our goal is to estimate the entropy of that random variable from a single sample.

If the source has no memory – that is, the individual symbols of the text are independent – then a good estimate of the entropy of the source is the empirical single-character entropy, which is the function that appears in your question. But general sources – such as a random text – do have memory: if you just sample a random list of characters according to their distribution in the English language, you will get gibberish (a nice example of this appears in Shannon's 1948 paper, p. 7). This is why we need to use a more sophisticated estimator, such as the Lempel–Ziv algorithm.

In practice, you can approximate the information content by just compressing the text using an off-the-shelf program.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514