10

What is the best way to summarize a long text that exceeds 4096 token limit (like a podcast transcript for example)? As I understand I need to split the text into chunks to summarize, and then concatenate the results and summarize those. Is there already a popular open-source script to do that?

Do I understand correctly that GPT-3 is the best model to do that? I've seen some articles about extractive summarization using BERT but the results were pretty low quality.

Poma
  • 203
  • 2
  • 6

1 Answers1

5

Is there already a popular open-source script to do that?

The Python library GPT Index (MIT license) can summarize a large document or collection of documents with GPT-3.

From the documentation:

index = GPTTreeIndex(documents)
response = index.query("<summarization_query>", mode="summarize")

The “default” mode for a tree-based query is traversing from the top of the graph down to leaf nodes. For summarization purposes we will want to use mode="summarize".

 A summarization query could look like one of the following:

  • “What is a summary of this collection of text?”
  • “Give me a summary of person X’s experience with the company.”

The documentation includes a notebook with complete examples: https://github.com/jerryjliu/gpt_index/blob/main/examples/paul_graham_essay/TestEssay.ipynb


Another Python library: https://github.com/hwchase17/langchain (MIT license). From the documentation:

from langchain.chains.summarize import load_summarize_chain
chain = load_summarize_chain(llm, chain_type="map_reduce")
chain.run(docs)

FYI {1,2} are two great papers looking at GPT-3 performance for summarization, but they only looked at short texts.

Update 2023-02-23: the next version of GPT may allow 32k tokens:

enter image description here


Update 2023-11-15: Interesting leaderboard for summarization of relatively short documents: https://github.com/vectara/hallucination-leaderboard

enter image description here


{2} compared human vs. LLM for summarization:

enter image description here

enter image description here

enter image description here


References:

Franck Dernoncourt
  • 5,862
  • 12
  • 44
  • 80