This is the second blog post in the series “Revolutionizing Sentiment Analysis with GPT-4”. As its name would imply, this series dives into how we’ve pushed the bleeding edge of what’s possible with GPT-4 to create an AI that transforms the previously manual process of qualitative data analysis.
Summarization is a fundamental task in generative NLP. Coherent, truthy summarization is held up as a key performance metric for large language models and wider AI technology. For processing large corpora of user feedback data, summarization seems an obvious tool. However, it is more limited than it appears at first glance. If the goal is to take hundreds of thousands of data points and boil them down to a report for a human to read in less than an hour, summarization is not enough. Summarization, as a task, possesses critical weaknesses that can distort data and lead to flawed decision-making downstream.
Accurately modeling large corpora of text requires a fundamentally different approach, what I broadly refer to as analysis. By exploring several key concepts, I hope to show you that summarization and analysis are fundamentally different machine learning tasks, requiring different bodies of training data, and ultimately different models. Virtually no open source or out of the box ML solutions provide a readymade toolbox for feedback analysis.
Compression vs Addition
In pure information terms, summarization and analysis behave very differently. Summarization is a form of intelligent compression. When dealing with language models, we see this clearly illustrated by token counts.
A summary will always contain less information than the original input content. A model like Davinci is so large and well-trained that it can summarize at a wide range of compression ratios. Different summarization datasets have different implied compression ratios. If you explore smaller models on the Hugging Face model hub, you can get a feel for these differences.
Here are two examples, both with a Pegasus model fine-tuned on a single summarization dataset:
Pegasus fine-tuned on xsum - An input of:
The tower is 324 meters (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 meters. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
is compressed to a summarization of:
The Eiffel Tower is a landmark in Paris, France.
Now take the same model, and instead of training on the aggressive, high-compression Xsum dataset, train it on the more verbose ccn_dailymail dataset. In this scenario, the same input above Is compressed less, into a longer summarization like this:
The tower is 324 meters (1,063 ft) tall, about the same height as an 81-story building, and the tallest structure in Paris. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.
The Xsum version compressed a 169-token input into a 13-token output, a 13-to-1 compression. The pegasus trained instead on ccn_dailymail news dataset, compresses those 169 tokens into a 73-token summary, only a 2.3-to-1 compression.
It’s safe to say OpenAI’s most capable GPT-4 model, Davinci, has seen datasets at both of these ratios and many many more. Davinci can reliably output summaries as aggressive as 4,000 tokens down to 1 token (a title essentially), and as lightly as removing a single token from a 2,000-token input to render a 1,999-token output.
These compression characteristics can be controlled by simple instructions (exploiting Davinci’s instruction tuning), by few-shot examples written into the prompt, or by actually fine-tuning on a specific style of summary.
From as low as 1.1-to-1 to a ratio as extreme as 4000-to-1, summarization is inherently a compression of information. The versatility of a model like Davinci can obscure the fact that all these compression ratios, and styles are really the same task, expressed through different datasets. It’s worth briefly looking at how common summarization datasets are designed. Each link will take you directly to each dataset's respective viewer on the Hugging Face Model hub.
Xsum - Inputs are news articles and news article snippets. The outputs are one sentence blurbs as you would see on the website of BBC news, next to each article's lead image.
The CNN / DailyMail - Inputs are news articles. Outputs are a larger style of blurb, which can be observed on the CNN and DailyMail websites.
Samsum - Inputs are text message conversations. Outputs are hand-built, one or two sentence descriptions of the conversation in the third person.
Amazon US reviews - Inputs are the bodies of Amazon product reviews. Outputs are one sentence headlines that accompany those reviews, written by the authors of the reviews.
Reddit - Inputs are the bodies of reddit posts from various subreddits. Outputs are the post titles, written by the same poster.
Billsum - Here we see a dataset with multiple combinations of possible input and output. Available inputs are snippets of the bodies of congressional bills, or long summaries of those bill bodies. Available outputs are the summaries (bill bodies as input), or the titles of the bills (summaries as input). It’s not uncommon to see datasets that are pivotable like this, with multiple possible permutations of possible inputs and outputs.
Big Patent - Inputs are the sections of patents, usually elaborating in detail the various claims of the patent. Outputs are the high level abstracts of the patents that outline the concept more globally.
Scientific Papers - Here’s another multi-part dataset. Available inputs are the body of an entire section of a scientific paper, or the entire abstract of the paper. Available outputs are the abstract (with the body as input) or the section title (with the abstract as input).
Pubmed Summarization - Inputs are the article bodies of the pubmed pages. Outputs are the abstracts.
Summary From Feedback - Last, but perhaps most interestingly, we have OpenAI's own summary dataset. This includes inputs texts from other datasets, some included here, but the outputs are Davinci summaries that have been ranked highly by a reward model. This mechanism of one model ranking the outputs of another model is the same technology used recently to great effect with ChatGPT.
To varying degrees of complexity, all these datasets attempt to harvest latent summary logic between two different sections of a larger document, be it a body and title, a section and an abstract, or a dialogue and description. This is stuff from which essentially all summary functionality emerges. Understanding this demystifies what a summary is. It is not a complete modeling of a text, it is not an interpretation of a text, or a “deep reading.” Summary is a compression built mostly from harvesting organic inputs and outputs from a vast corpora of noisy internet data. The upshot is that little new information enters the system. Summarization is not additive because the underlying task is principally concerned only with fitting the information in a larger body of text down to a smaller target. Summarization does not concern itself with a wider context outside that of the input text. To a model performing summarization, the world is shrunk down to the text itself, logic be damned.
But analysis is inherently additive.
If the summarization task is just compression, what is the analysis task?
Analysis is about bringing external context from outside of the text to bear upon the text, to interpret the text.
This doesn’t mean that analysis must be more verbose than the input text, but it does mean that new information that is not found in the input text must be added to the output, and thus the system as a whole.
Here’s a highly minimalistic example, taking the same Eiffel Tower blurb:
The tower is 324 meters (1,063 ft) tall, about the same height as an 81-story building, and the tallest structure in Paris. Its base is square, measuring 125 meters (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 meters. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 meters (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.
Now let’s run one of the simplest analyses possible, an analysis for word count.
In the preceding blurb about the Eiffel Tower, there are 132 words, and 743 characters.
The original Eiffel Tower blurb does not include explicit information about its own word count and character count. The computation of the word and character count is deduced implicitly from the text by the model and then added as novel information to the output. Even at this robotic level of analysis, new information has been introduced to the system of input and output, from external computational context.
Because summarization is inescapably compressive, there is actually more net information in this combination of input text + simple word count, than the input + cnn_dailymail summarization output from earlier. A summarization “system” can only ever be an input and an imperfect facsimile of that input as output; where analysis rather will be additive, from something as minimal as a word count, to something as complete as an essay.
Let’s try a historical timeline analysis of the Eiffel Tower blurb.
Beginning with the Great Pyramid in 3200 BC, the timeline of tallest structures proceeds to Lincoln Cathedral, England (1311), to St. Mary's Church, Germany (1549), to Beauvais Cathedral, France (1569), to St. Mary's Church, Germany (1576), and In turn to Strasbourg Cathedral, France (1647). Upon its completion in 1884, the Washington Monument overtakes this long line of religious structures, standing 169 meters tall (554 feet). By 1889 it would be upstaged by the Eiffel Tower. The tower is 324 meters (1,063 ft) tall, about the same height as an 81-story building, and still the tallest structure in Paris. Its base is square, measuring 125 meters (410 ft) on each side. It was the first structure to reach a height of 300 meters. The Eiffel Tower would hold the record for 41 years until the Chrysler Building in New York City was finished in 1930. Only one year later, the Chrysler Building would be topped by the Empire State Building. Interestingly, due to the addition of a broadcasting aerial at the top of the tower in 1957, the Eiffel Tower became taller than the Chrysler Building by 5.2 meters (17 ft). The Empire State Building would hold its record for 36 years until beaten in 1967 by the Ostankino Tower in Moscow. In turn, Ostankino would be passed by the CN Tower in Toronto, Canada (1975). In 2004 the Eiffel Tower was overtaken as the tallest structure in France by the Millau Viaduct. Globally, the record for tallest structure would be won from the CN Tower, by the Burj Khalifa in Dubai (2007), the current record holder.
We can see in executing this analysis that 192 tokens of external information (bolded above) have been added to the output – this is a 2-to-1 additive expansion of the input. This analysis retains all the information of the original text while placing it in a vast historical context, providing the reader with numerous insights, all rendered in chronological order.
While the first supremely simplistic example of analysis, the word count, added new information by executing a calculation on the text itself, the historical analysis requires the synthesis of large amounts of external world knowledge.
Logical Synthesis of World Knowledge
Comprehension requires context. Few corpora are complete enough to be fully understood without additional information about the people, places, things, terms, and ideas present in the text. This is doubly true for customer feedback. Academic texts, essays, papers, and wikipedia entries are composed for coherence, context has been injected to maximize it. In these texts, we can fool ourselves into thinking a summary is performing reasoning, while it is really just compressing a high quality input into a high quality miniaturization.
Customer feedback shows a much more realistic domain, where imperfect organic data exists isolated from the context needed to best understand it. Such an applied domain is also where we can best prove the inadequacies of the summarization task.
Below are 5 texts that have been clustered together from app reviews for the Pokemon Go game using an embedding-driven clustering algorithm. They concern a feature within the game. Observe that the name, Pokemon Go, appears nowhere in the actual text.
1. Why change something that was working great? If your goal is to have no one playing the game then your on the right path! Changing the incense effectiveness is terrible for everyone but especially the disabled. I guess you don't care.
2. Massively needed incense hurts us rural players a ton. Predatory to host a paid event where the main reward were incense to then nerf it in to uselessness 12 hours later. Shame on you.
3. The devs previously stated that they would give players a months notice before changing incense effectiveness. They went back on their promise and gave us less than a day before implementing said change. Why would they do this? Do they not care about rural players? (The answer is that of course they don't and never have.) The devs don't care about their player base. Watch and see, they'll nerf raids too!
4. They decided to completely nerf a PREMIUM PAID ITEM. Give us back useful incense! You show time and time again how little you care about your player base, despite the fact that these are the people who pay your paycheck. START ACTUALLY LISTENING TO WHAT THE PLAYERS WANT.
5. Boycott this one please everyone disabled players have been seriously impacted by the changes to incense the change in distance for gym and pokestop was a slap in the face but this incense change ruins the game for those of us that can't regularly or routinely get out there.
Now imagine a thousand such texts, and it’s easy to see the need for NLP when trying to comprehend such data en masse. Reading these individual reviews we are sorely deprived of the wider context of what all this means.
Let’s see how various models handle this tiny cluster out of the box. To make the competition fair between models, GPT-4 will be prompted only with a simple zero-shot prompt like this:
Concisely summarize the following CLUSTER OF RELATED TEXTS:
CLUSTER OF RELATED TEXTS:
(the texts above)
DAVINCI 003 ZERO-SHOT SUMMARY:
The developers of a game have recently made changes to the effectiveness of incense, a premium paid item, without prior warning, despite promising to give a certain amount of days' notice. This change has negatively impacted players, especially those who are disabled, rural players, and those who rely on using incense to play the game. Players are frustrated and feel that the developers are not listening to their feedback.
Why change something that was working great?<n>If your goal is to have no one playing the game then your on the right path!<n>Massively needed incense hurts us rural players a ton.<n>The devs previously stated that they would give players a months notice before changing incense effectiveness.<n>They went back on their promise and gave us less than a day before implementing said change.
The developers changed the incense effectiveness and nerfed a pre-paid item without giving players a proper notice. The change hurts rural players the most. The developers don't care about their player base. They should listen to what the players want and change the game as they see fit.
Nerfing incense is a terrible idea and a huge slap in the face to the game's player base.
We can see that Davinci 003 handily outperforms the smaller models, capturing the key point of how the feature change affects disabled and rural players. All the other summaries ignore this crucial part of the data. However, even in the Davinci summary, we are left with key questions about the data. The following points are left unanswered by the summaries.
A natural response to these questions would be to say “well, the model must not know these facts.” Let’s query Davinci 003 directly for some of the relevant information.
PROMPT - If I say "Boycott this one please everyone disabled players have been seriously impacted by the changes to incense the change in distance for gyms", about a popular video game, what video game am I most likely to be playing?
It is difficult to determine the specific video game you are playing without more information. However, it is likely to be a popular mobile game with an augmented reality (AR) component, such as Pokémon Go or Harry Potter: Wizards Unite.
PROMPT - What is a pokestop?
A Pokestop is a location in the augmented reality game Pokemon Go where players can collect items such as Poke Balls, Potions, and Eggs. Pokestops are usually located at public places such as monuments, statues, and other landmarks.
PROMPT - In the context of Pokemon, what is "incense"?
Incense is an item in the Pokémon games that can be used to attract certain kinds of Pokémon. When used, it will cause a wild Pokémon of a certain species to appear more often than normal.
PROMPT - In the context of Pokemon Go, what is "incense"?
Incense is an item in Pokemon Go that attracts wild Pokemon to your location for a limited amount of time. When used, it emits a scent that lures wild Pokemon to your location for 30 minutes.
PROMPT - If you're disabled or living in a highly rural setting how would incense be useful when playing Pokemon Go?
Incense can be used to attract Pokemon to a specific location, which can be especially helpful for those who are disabled or living in a highly rural setting. Incense can be used to draw Pokemon to a player's location, allowing them to catch Pokemon without having to travel long distances. Additionally, incense can be used to attract rarer Pokemon, which can be beneficial for those who are unable to travel to more populated areas.
Davinci 003 knows all the specific terms in the text, and can even reason as to why the nerfing of the incense feature would negatively affect disabled and rural players. What this clearly shows us is that summarization and analysis are two entirely different tasks in ML terms. This failure to synthesize the needed world knowledge to create a better modeling of the data is the smoking gun for summarization’s flaws. To do more than simply compress input information, the model must be trained or prompted to synthesize world knowledge and draw logical conclusions from it.
This requires different data, different prompts, different pipelines, and potent mechanisms of quality control.
Let’s counterpoint the context-deaf summary with a context-aware analysis that explains the texts in a more complete way.
The developers of the augmented reality video game Pokemon Go have recently made changes to the effectiveness of incense, a premium paid item that attracts pokemon to the location of the player for a limited time. They made this change without prior warning to the players, despite promising to give a certain amount of days' notice. This change has negatively impacted players, especially those who are disabled, because instead of being forced to walk around their neighborhood to find pokemon, incense allows them to capture pokemon from their homes. Similarly, rural players are affected because some pokemon only spawn naturally in a more populated area, incense allows them to attract these pokemon right to their homes as well. Players are frustrated and feel that the developers are not listening to their feedback.
This is a very simple analysis, far more basic than our historical timeline analysis for the Eiffel Tower text. It simply writes the unwritten, but obvious, implications of the cluster of texts. If educated humans were to be tasked with “summarizing” the cluster of texts, it’s very likely their summary would look like this topic analysis, unpacking the obvious underlying reasons for the existence of the texts themselves.
In the third and final post in this series, we’ll explore what it takes to get a large language model to reliably perform feedback analysis on larger and more complex clusters of texts. It will become clear that to fully model textual data in a faithful way requires a concert of difficult ML tasks all working together to properly apply knowledge, execute computation, and interpret proportionality within large heterogeneous corpora.
Want to see how our AI can analyze your qualitative data?
Sebastian Barbera is Viable’s Head of AI. Sebastian is an autodidact, having taught himself machine learning engineering and AI. He is an inventor with multiple software and hardware patents, and helped build Viable’s AI from Simple Unenriched Summarization to Multidimensional Analysis in a span of 17 months. Prior to Viable, Sebastian founded and ran his own company. He is based in New York City.