Abstract
Recent advancements in language processing have demonstrated the advanced capabilities of language models. Particularly noteworthy is the heightened prowess of pre-trained large language models in tackling tasks that were a real challenge a few years ago, such as the abstractive summarization of dialogs. An approach to generating summaries involves engineering prompt templates. The easiest way would be by using a static prompt, but it can lead to unreliable outcomes for different classes of dialogs. We implemented a scoring system to enhance the performance of a few-shot training. This involves constructing finely tuned prompts composed of dialog samples with the highest scores. The scoring process is grounded in a set of heuristics that specifically assess the structure and content of the dialogs. The use of the scoring system resulted in enhanced ROUGE scores and positive evaluations from human assessors. These promising results were consistently validated across all three large-scale datasets used in the testing phase.
Keywords
Introduction
Language models have evolved significantly in recent years. Although task-specific models have proven to be highly effective in excelling at single tasks (Chai & Li, 2019; Klosowski, 2018), large language models (LLMs) have demonstrated the ability to handle a wide variety of NLP tasks without requiring supervised learning. Colossal models such as GPT-3 (Brown, 2020) and GPT-4 (OpenAI, 2024), built on the Transformer architecture (Radford & Narasimhan, 2018) with self-attention mechanisms (Guo, 2019; Vaswani, 2017), have already been widely adopted by thousands of developers.
One of the key reasons for their success is their ability to perform few-shot learning without weight updates (Brown, 2020), enabling the rapid development of applications across various domains, including classification, semantic search, content generation, and summarization. A challenging problem that can be addressed using state-of-the-art NLP technologies is dialog summarization. In this study, we will tackle the abstractive summarization, a generic and informative single-document summarization (Gupta & Gupta, 2019). For this purpose, we propose a prompt engineering approach based on a set of heuristics.
To take advantage of LLMs, it is crucial to craft effective prompts. This necessity has led to the development of the field of prompt engineering (Chen et al., 2023a). Among the various prompting strategies, the most straightforward are known as vanilla prompts. These prompts involve adding specific instructions, such as appending phrases like
Enhancing prompts can be achieved through a technique known as
Improving the performance of few-shot training is a problem of interest because there can be instabilities in LLM performance due to the way the prompt is chosen (Lu, 2021). A solution based on contextual calibration has been proposed for tasks such as text classification, fact retrieval or information extraction (Zhao, 2021). Prompt-based tuning has been shown to increase language model performance (Gao et al., 2021; Hu, 2021). It is known that GPT models are multitask learners (Radford et al., 2019), but different tasks demand different prompt tuning approaches.
The goal of this work is to find the best way of building the prompts for dialog summarization in a few-shot training regime by selecting the most effective training examples. Therefore, our main contribution is the design of a set of heuristic functions that optimize the choice of the samples included in the prompt. This kind of approach, known as in-context learning, has been proven to be successful in enhancing the performance of the model (Liu et al., 2021; Wang et al., 2024).
We propose a simple but efficient content-, size-, and interlocutor-based scoring system (CSIS) based on the following heuristics:
In the next section, we discuss related work. The heuristics presented above are implemented according to the methods presented in Section 3. The experimental set-up is described in Section 4. In Section 5, we present our results. The performance evaluation is based on the ROUGE scores (Lin, 2004). A human evaluation is conducted to cross-check the ROUGE scoring. The conclusions are stated in the last section.
Related Work
Research on dialog summarization began to gain popularity a few years ago (Feng et al., 2021c). There was an impressive increase in datasets for the summarization of chat conversations (Chen et al., 2021; Gliwa et al., 2019; Mehnaz, 2021; Zhu et al., 2021) and several models were developed (Chen & Yang, 2021; Chien-Sheng, 2021; Dong et al., 2019; Fabbri et al., 2021a; Feng, 2021a, 2021b; Jingqing, 2020; Mike, 2020; Shashi, 2021; You & Ko, 2023; Zhang, 2020; Zhao et al., 2020). Some models such as the one proposed by Chen et al. (2023b) had undergone pre-training to be functioning in a few-shot setting, albeit requiring subsequent training that entails costs and resource investments. However, several models could also run in a zero-shot setting. In 2022, Zhao et al proposed a model designed for zero-shot summarization, utilizing domain-oriented prefixes.
Another research direction in this field focuses on retrieval-augmented generation (RAG) models, introduced in Lewis et al. (2020). These models combine pre-trained parametric memory with non-parametric memory accessed through dense vector retrieval, improving factuality and diversity in language generation for knowledge-intensive tasks. Based on RAG, a dialog modeling solution was proposed by Kumari et al in 2023. The authors used instruction prompts and retrieval-based context augmentation to adapt LLMs for extended conversations, without fine-tuning, enhancing dialog generation by efficiently incorporating relevant context from past interactions.
Prompt-tuning techniques based on negation understanding and name substitution for dialogs with multiple participants were investigated by Khalifa et al. (2021). Another group proposed a recursive prompting method that involves consecutive calls of the model Adams et al. (2023), and the following year, Li et al. (2022) introduced prompt learning for dialog summarization. These methods distinguish themselves by sidestepping model fine-tuning, keeping the model’s weights frozen.
The study described by Tang et al. (2023) introduced a controlled dialog summarization framework leveraging TF-IDF-based entity selection (Qaiser & Ali, 2018; Sammut & Webb, 2010), length constraints, and personal named entity planning to align outputs with user-provided signals. Their approach incorporated these signals into the prompt to constrain the output of the model. Another work focused on selecting relevant utterances by inserting special tokens (Italiani et al., 2024). For this purpose, they fine-tuned another language model to assess the relevance of each utterance and determine whether or not it should appear in the generated summary. Summarization of long dialogs was studied as well by Zhang et al. (2021). Lengthy dialogs often exceed the input limits of language models. To address this issue, the authors proposed retrieving parts of these dialogs to create shorter input prompts, reducing their length to 10% of the original. They were using algorithms based on TF-IDF and BM25 (Robertson et al., 1995) to shorten the dialogs. Other similar approach was analyzed in another work (Zhong et al., 2021), where the authors proposed the Locator model, that retrieves relevant sequences of long meeting dialogs.
Methods
To have the best picks of dialogs for few-shot training, we consider the following features: Content, token count, and attendance. Each of them is characterized by a corresponding score (
Content
The content evaluation is based on the TF-IDF approach. chosen for its simplicity and adaptability across different datasets. Different variants of this approach can be further developed depending on the context. To define the tokens used in the TF-IDF computation, we employed the BERT tokenizer Devlin et al. (2019) from the Transformers library (Wolf, 2020). This choice was primarily driven by implementation convenience and seamless integration into our existing pipeline. The choice of tokenizer does not significantly impact the model’s performance in this context, as the TF-IDF metric tends to produce similar results regardless of the tokenizer used. Thus, we find the token distribution for each dialog and compute the TF-IDF weights,
The content score,
Token Count
Let us consider a dataset of dialogs,
The last condition increases the chances of choosing shorter conversations if there are multiple conversations with similar content. There is no reason to allow an increase in ATC if the content score is the same. If
One can reach the maximum score if it picks a conversation of an equal token count. If such a sample cannot be found, we try to find another sample with a small error. Short conversations are advantageous for creating prompts with a low ATC. To prioritize shorter dialogs while still allowing longer ones to be selected occasionally, we require an asymmetrical curve that adjusts selection probabilities dynamically. For this purpose, we propose the asymmetrical double sigmoid (
The Free Parameters of the Asymmetrical Double Sigmoid Function Used in Finding the Length Similarity Coefficient.
The number of participants attending the conversation is also a key factor. As described in Bokaei (2016), one obtains better results when the summarization system takes into account the number of persons involved in the discussion. For instance, conversations between two people are typically easier to summarize than those involving more participants. In this case, we should prioritize prompt examples with only two interlocutors, to maintain the same level of coherence as stated in the third heuristic. We have designed a function that sets a maximum score when the two dialogs have the same number of participants. The score decreases based on the relative difference in the number of participants,
Weights
The final score is given by the weighted sum of the three scores,
We perform tests for several weights while keeping the same GPT-3 configuration. In the GPT-3 configuration, we refer only to the engine, the number of shots, and the temperature value, set to 0. The temperature value measures how randomly the engine chooses the tokens in the generated summary. A low temperature implies picking the most probable tokens predicted by the model. Thus, a higher temperature value produces larger fluctuations in the results and the ROUGE scores we obtain. In our experiments we use zero temperature to avoid fluctuations.
We used the GPT-3 API 1 provided by OpenAI (Brown, 2020) to access the GPT-3 engines (text-curie-001, curie, curie-instruct-beta, ada, babbage).
Datasets
We briefly describe the datasets on which we performed the experiments. They are datasets used in the training of dialog summarization:
We perform preliminary processing for the dialogs used as shots in the prompt—the selection pool (SP). We construct the term frequency matrix and compute the inverse document frequency coefficients. Additionaly, we retrieve the token count and the number of people participating in the conversation for each dialog in SP. In our experiments, the SP comprises between 12 k and 14 k dialogs, depending on the dataset. We perform the tests using input dialogs from the testing part of each dataset. The data reduction of the SP is performed only once before testing. For each SP dialog, we compute and save the above-mentioned features.
Prompt Engineering
For each input dialog, CSIS assigns a score by comparing its features to each entry in the SP. Then, we retrieve the content and the summary for the highest k-scored dialogs, where

A diagram that illustrates the overall scoring process and prompt generation.
The prompt includes the following components: An instruction hint, the selected samples, the input dialog, and the delimiters. We provide an example of a two-shot prompt in Appendix A.
The function that computes the size coefficient significantly contributes to cost minimization. Similar to the standard deviation or the full width at half maximum (FWHM) an interval can be defined within the dialog picks to be scored high. The asymmetrical double sigmoid proposed in the previous section (ads broad) exhibits larger FWHM of 85.6 tokens. During our empirical study, we performed preliminary experiments on the SAMSum Corpus dataset to analyze different functions for obtaining the length similarity coefficient. In Table 2, we compare our proposed function with other possible options:
Random choice (not any cost constraints) A normal distribution with A asymmetrical double sigmoid with an FWHM of 61.8 tokens (ads narrow) A piecewise function for which
The results are derived by averaging the corresponding ATC and ROUGE scores of the outputs of 100 two-shot prompts using CSIS with
Comparison of Different Functions for Computing the Length Similarity Coefficient on the SAMSum Corpus Dataset.
We run several experiments with GPT-3 text-curie-001 for 50 tests each to identify the best set of weights. During these experiments, CSIS uses the ads-broad for the token count function and the exponential function for attendance as described in the previous section. We cannot decide on a single configuration of the weights because the tests are not numerous enough to conclude something statistical significant. We show the experimental results in Appendix B. Further experiments are performed on a smaller set of weights, by running 200 tests on each configuration. The results are presented in Table 3. In the last row, we show the scores obtained by using random prompts (no feature scoring is used). An increase in ROUGE scores is observed when using CSIS for all three datasets. The effect of CSIS is obvious in the datasets containing longer dialogs (DSd and MSd) the token count distribution can be seen in Figure 2). The ATCs of the datasets,
Experimental Results—Weights Configuration 200 Tests.
Note. SCd = SAMSum Corpus; DSd = DialogSum; MSd = MediaSum.
Experimental Results—Weights Configuration 200 Tests.

Datasets statistics on token counts.
Performance Evaluation of CSIS
Experimenting on Different Datasets
We experiment by running at different numbers of prompt tuning shots. As expected, the performance increases almost every time more tuning samples are provided to the prompt. The left subfigure of Figure 3 shows the ROUGE-1 score as a function of the number of shots. We repeated the experiment for each dataset. CSIS results consistently outperform their “random picks” counterparts for all datasets and shot configurations on average by 1.1 (SCd), 1.7 (DSd), and 7.3 (DSd) ROUGE-1 score units, demonstrating the effectiveness of the proposed scoring system in selecting better prompts. For MSd, while the absolute ROUGE-1 scores are lower compared to other datasets, the use of CSIS results in a more significant relative improvement over random picks, particularly at Three-Shots. MSd has a significantly larger mean token count (

Left subfigure shows the content-, size-, and interlocutor-based scoring system (CSIS) performance evaluations with respect to the baseline of random generated prompts. The weight configurations are those highlighted in Table 3. The right subfigure presents the results obtained for different GPT-3 engines. In both cases it is shown the evaluation by ROUGE-1 scores (scaled between 0 and 1).
We evaluate the performances of the ada, babbage, curie, and curie-instruct-beta engines of GPT-3 with and without using CSIS. As expected and shown in the right subfigure of Figure 3, the performance decreases as we move from larger engines (curie-instruct-beta) to smaller ones (ada), the ROUGE-1 score dropping by 12–18 units. Despite the performance drop with smaller engines, the use of CSIS consistently enhances results across all engines showing its effects in both resource-constrained and larger-scale scenarios. The best improvement of 2.1 ROUGE-1 score units on average is seen for the ada model, whereas for the curie-instruct model we notice an improvement of 1.1 ROUGE-1 score units on average.
Comparison to Fine-Tuned Models
We fine-tune a GPT-3 curie model for four epochs to compare its performance to that of the proposed prompt-tuning method. Different SPs between 50 and 2000 SCd samples are used for training the fine-tuned model. The results illustrated in Figure 4 show that fine-tuning outperforms our prompt-tuning method by 0.07 ROUGE-1 score units on average. Additionaly, fine-tuning can demand more computational resources. In four epochs, training requires spending between 45 k and 1600 k tokens depending on the SP size (the training set), whereas only prompt-tuning spends on average 221 tokens per query. Thus, approximately 400 queries based on prompt tuning can run using the same number of tokens elapsed in training the model on 100 samples. However, this can be a good investment if the fine-tuned model is used for a long-term period in a stable environment (meaning that the SP samples are still efficient as training examples after a period of time). Using prompt-tuning or fine-tuning is a decision that should be analyzed depending on the targeted performance, costs, and the expected queries the model will process.

Comparison between prompt-tuning and fine-tuning performances with respect to the selection pool size.
In addition, using the same SPs we fine-tune a T5 model (Raffel et al., 2020) with 60 M parameters (t5-small), a BART model (Lewis et al., 2019) with 406 M parameters (bart-large-cnn), and a GPT-2 model (Radford et al., 2019) with 124 M parameters. We test the fine-tuned models on the same subset of 100 samples. In the case of the T5 (t5-small) and BART (bart-large-cnn) models, one can observe comparable performances to our approach.
The SP for each dataset remains unchanged during the experiments. However, to investigate whether CSIS results depend on the SP, we perform a series of different experiments that also include a significant amount of foreign data.
We mix SCd and DSd data to create a heterogeneous SP. The new SP consists of 54% of the SCd samples (14,732) and 46% the DSd samples (12,460). We run the model again for 100 tests with a curie-instruct-beta engine and the weights
The ROUGE Scores, Average Scores Provided by CSIS (
), and ATC for Different Selection Pools and Numbers of Shots used in the Training.
Note. ATC = average token count; CSIS = content-, size-, and interlocutor-based scoring; SP = selection pool; SCd = SAMSum Corpus; DSd = DialogSum.
The ROUGE Scores, Average Scores Provided by CSIS (
Software applications can be developed using CSIS we propose. To do so, one must consider a way of defining the SP. Different categories of users may use distinct SPs. Depending on the users’ behavior, the SP will also have to change over time. Consequently, a solution based on dynamic SP can be implemented. User feedback can also be used in updating the SP in two ways: (i) By providing better summaries and (ii) by setting up several preferences. Methods for gathering data to improve or form new SPs could be a continuation of this work.
We remove one component of CSIS at a time and perform a series of tests varying the weights of the remaining features. These experiments are meant to prove the relevance of each feature considered by CSIS. We plot the ROUGE-1 score difference between a reference score and the new score,

Results of content-, size-, and interlocutor-based scoring system ( CSISDIFadd) ablation study: The ROUGE score difference (
To examine if the models can leverage information from all provided dialogs present in ours prompts, we conducted an experiment analyzing how the removal of parts of the few-shot example dialogs affects summarization quality. Specifically, we randomly remove parts from the input dialogs and evaluated the performance using ROUGE-1 F1 score differences between the incomplete prompt and the full original prompt,
We show the results in Table 5. The tests are grouped by intervals of token amounts removed. Performance generally declines for SCd and DSd, highlighting the importance of providing complete and well-structured few-shot examples for LLMs to achieve optimal summarization quality, with performance decreasing when data is omitted. On the other hand, we do not notice a substantial difference for MSd. One plausible explanation for this is that most of the dialogs in this dataset are much longer than in the other datasets, and removing a part of the prompt equivalent to 250 tokens does not have a significant impact.
Impact of Utterances Removal on ROUGE-1 F1 Scores Across Different Datasets (SCd, MSd, DSd). The Table Shows the Number of Tokens Removed, the Corresponding Changes in ROUGE-1 F1 Scores, and the Number of Tests for Each Interval of Removed Tokens.
Note. SCd = SAMSum Corpus; DSd = DialogSum; MSd = MediaSum.
Impact of Utterances Removal on ROUGE-1 F1 Scores Across Different Datasets (SCd, MSd, DSd). The Table Shows the Number of Tokens Removed, the Corresponding Changes in ROUGE-1 F1 Scores, and the Number of Tests for Each Interval of Removed Tokens.
We consider a human evaluation necessary because the main known problem of dialog summarization is when the summary provides incorrect references (Chen & Yang, 2020; Feng et al., 2021c), that distort the information from the original dialog, a phenomenon also known as hallucinations (Yichong, 2021; Zheng, 2020). These hallucinations cannot be identified through ROUGE scores, so it is imperative to have a second evaluation to validate the ROUGE results. The human evaluation is based on the following four criteria: Coherence—which is related to the overall quality of the sentences and how well is the summary structured, Consistency—which measures the factual information transfer, Fluency—it scores the quality of individual sentences (i.e., grammar, formatting, and so forth) and Relevance—it shows how important is the selected information, an excess of information is penalized. These criteria have been used in other studies as well, e.g. (Fabbri, 2021b) Table 6. We ask the annotators to rate different summaries for 100 dialogs on a scale from 1 to 5. The evaluation is blind, including the reference summary from the SAMSum dataset and the summaries generated by GPT-3’s curie-instruct-beta (with two shots,
Example of Summary Scored low by ROUGE Metrics, but Evaluated as very Good by Human Judgment.
Example of Summary Scored low by ROUGE Metrics, but Evaluated as very Good by Human Judgment.
Human Evaluation Results.
The main reason for conducting a human evaluation is that ROUGE scores are not always reliable. We calculate the Pearson coefficient to evaluate the correlation between human judgment and ROUGE scores (Table 8). The scores indicate a weak correlation between the evaluation results. There are many examples, such as the one provided in Table 6, where very good summaries are rated with low ROUGE scores, namely at least by 10 ROUGE-1 score units under the average. This is due to ROUGE’s reliance on exact word matching, which can penalize synonyms or paraphrases that are semantically similar but not an exact match. This limitation is well documented in the literature (Nguyen et al., 2024; Schluter, 2017).
Pearson Correlation Coefficients Between Human Evaluation Results and ROUGE Metrics for Two set-ups: GPT-3 with content-, size-, and interlocutor-based scoring system (CSIS) and Without it (Random Prompts).
In this study, we investigated a prompt-based approach to improve the abstractive summarization of the dialogs. We showed that:
CSIS consistently outperforms random prompts across different scenarios, achieving an average increase in ROUGE-1 scores of Fine-tuning the GPT-3 curie model on SCd yields better results than prompt-tuning with GPT-3 curie using SCd as SP (by The SP choice plays a role in the model’s performance. Mixing datasets from different sources (e.g. SCd and DSd) results in slightly better performance in one-shot settings. This improvement can be attributed to the increased diversity of samples and a broader range of choice options, allowing the model to adapt better to varied input characteristics. The ablation study confirms that all components of CSIS contribute positively to its performance. While size has the smallest impact in SCd due to its narrow token range, attendance is less impactful in more complex datasets like MSd and DSd, highlighting dataset-specific feature relevance. We evaluated CSIS using ROUGE metrics and by conducting a human evaluation. In our experiments, small variations in the average ROUGE score corresponded to large discrepancies in the scores given by the annotators. However, both evaluations showed that applying CSIS increases the quality of the summaries. The human evaluation showed us that choosing similar dialog samples in creating the prompt increases the quality of the summaries reducing the number of failures by
Limitations
The current method is applicable exclusively to LLMs that support few-shot training. It may be applied in selecting fine-tuning samples as well, but there should be a specific demand—a reference set of dialogs on which the scoring happens. The efficiency of the scoring system may be compromised when input dialogs significantly differ from the samples used to construct the prompt.
