Sage Journals: Discover world-class research

Abstract

Recent advancements in language processing have demonstrated the advanced capabilities of language models. Particularly noteworthy is the heightened prowess of pre-trained large language models in tackling tasks that were a real challenge a few years ago, such as the abstractive summarization of dialogs. An approach to generating summaries involves engineering prompt templates. The easiest way would be by using a static prompt, but it can lead to unreliable outcomes for different classes of dialogs. We implemented a scoring system to enhance the performance of a few-shot training. This involves constructing finely tuned prompts composed of dialog samples with the highest scores. The scoring process is grounded in a set of heuristics that specifically assess the structure and content of the dialogs. The use of the scoring system resulted in enhanced ROUGE scores and positive evaluations from human assessors. These promising results were consistently validated across all three large-scale datasets used in the testing phase.

Keywords

natural language processing pre-trained language models abstractive summarization prompt engineering

1. Introduction

Language models have evolved significantly in recent years. Although task-specific models have proven to be highly effective in excelling at single tasks (Chai & Li, 2019; Klosowski, 2018), large language models (LLMs) have demonstrated the ability to handle a wide variety of NLP tasks without requiring supervised learning. Colossal models such as GPT-3 (Brown, 2020) and GPT-4 (OpenAI, 2024), built on the Transformer architecture (Radford & Narasimhan, 2018) with self-attention mechanisms (Guo, 2019; Vaswani, 2017), have already been widely adopted by thousands of developers.

One of the key reasons for their success is their ability to perform few-shot learning without weight updates (Brown, 2020), enabling the rapid development of applications across various domains, including classification, semantic search, content generation, and summarization. A challenging problem that can be addressed using state-of-the-art NLP technologies is dialog summarization. In this study, we will tackle the abstractive summarization, a generic and informative single-document summarization (Gupta & Gupta, 2019). For this purpose, we propose a prompt engineering approach based on a set of heuristics.

To take advantage of LLMs, it is crucial to craft effective prompts. This necessity has led to the development of the field of prompt engineering (Chen et al., 2023a). Among the various prompting strategies, the most straightforward are known as vanilla prompts. These prompts involve adding specific instructions, such as appending phrases like Summary: or Summarize this conversation in a few sentences, after the dialog to guide the LLM in generating the desired responses. Therefore, we expect the model to follow a certain instruction.

Enhancing prompts can be achieved through a technique known as few-shot training, wherein a few summarization examples are provided. This approach does not involve (re)training the model, as the underlying parameters are frozen. Thus, a prompt can be created by including one or more summarized dialogs along with the input dialog, which is given to the model for completion. The fact that this can be done only by using a few training samples, without changing any of the model weights, has several advantages over classic fine-tuning: (i) There is no need to create another model, saving memory space and time, and (ii) even in the absence of corpora for fine-tuning purposes, we can still obtain reliable results through prompt tuning alone.

Improving the performance of few-shot training is a problem of interest because there can be instabilities in LLM performance due to the way the prompt is chosen (Lu, 2021). A solution based on contextual calibration has been proposed for tasks such as text classification, fact retrieval or information extraction (Zhao, 2021). Prompt-based tuning has been shown to increase language model performance (Gao et al., 2021; Hu, 2021). It is known that GPT models are multitask learners (Radford et al., 2019), but different tasks demand different prompt tuning approaches.

The goal of this work is to find the best way of building the prompts for dialog summarization in a few-shot training regime by selecting the most effective training examples. Therefore, our main contribution is the design of a set of heuristic functions that optimize the choice of the samples included in the prompt. This kind of approach, known as in-context learning, has been proven to be successful in enhancing the performance of the model (Liu et al., 2021; Wang et al., 2024).

We propose a simple but efficient content-, size-, and interlocutor-based scoring system (CSIS) based on the following heuristics:

Familiarity with the Topic. It is easier to summarize a conversation on a familiar topic. The provision of a relevant example (shot) enhances the model’s ability to generate accurate summaries.

Information Volume. Summarizing dialogs with a substantial volume of information proves more challenging than those with lower information volume. The prompts are scaled to prevent an underestimation of the required information in the prompt, thereby better managing the computational resources.

Number of Participants. The coherence of a conversation tends to diminish as the number of participants increases. It is essential to instruct the model on handling discussions with a specific number of interlocutors to ensure effective comprehension and response generation.

In the next section, we discuss related work. The heuristics presented above are implemented according to the methods presented in Section 3. The experimental set-up is described in Section 4. In Section 5, we present our results. The performance evaluation is based on the ROUGE scores (Lin, 2004). A human evaluation is conducted to cross-check the ROUGE scoring. The conclusions are stated in the last section.

2. Related Work

Research on dialog summarization began to gain popularity a few years ago (Feng et al., 2021c). There was an impressive increase in datasets for the summarization of chat conversations (Chen et al., 2021; Gliwa et al., 2019; Mehnaz, 2021; Zhu et al., 2021) and several models were developed (Chen & Yang, 2021; Chien-Sheng, 2021; Dong et al., 2019; Fabbri et al., 2021a; Feng, 2021a, 2021b; Jingqing, 2020; Mike, 2020; Shashi, 2021; You & Ko, 2023; Zhang, 2020; Zhao et al., 2020). Some models such as the one proposed by Chen et al. (2023b) had undergone pre-training to be functioning in a few-shot setting, albeit requiring subsequent training that entails costs and resource investments. However, several models could also run in a zero-shot setting. In 2022, Zhao et al proposed a model designed for zero-shot summarization, utilizing domain-oriented prefixes.

Another research direction in this field focuses on retrieval-augmented generation (RAG) models, introduced in Lewis et al. (2020). These models combine pre-trained parametric memory with non-parametric memory accessed through dense vector retrieval, improving factuality and diversity in language generation for knowledge-intensive tasks. Based on RAG, a dialog modeling solution was proposed by Kumari et al in 2023. The authors used instruction prompts and retrieval-based context augmentation to adapt LLMs for extended conversations, without fine-tuning, enhancing dialog generation by efficiently incorporating relevant context from past interactions.

Prompt-tuning techniques based on negation understanding and name substitution for dialogs with multiple participants were investigated by Khalifa et al. (2021). Another group proposed a recursive prompting method that involves consecutive calls of the model Adams et al. (2023), and the following year, Li et al. (2022) introduced prompt learning for dialog summarization. These methods distinguish themselves by sidestepping model fine-tuning, keeping the model’s weights frozen.

The study described by Tang et al. (2023) introduced a controlled dialog summarization framework leveraging TF-IDF-based entity selection (Qaiser & Ali, 2018; Sammut & Webb, 2010), length constraints, and personal named entity planning to align outputs with user-provided signals. Their approach incorporated these signals into the prompt to constrain the output of the model. Another work focused on selecting relevant utterances by inserting special tokens (Italiani et al., 2024). For this purpose, they fine-tuned another language model to assess the relevance of each utterance and determine whether or not it should appear in the generated summary. Summarization of long dialogs was studied as well by Zhang et al. (2021). Lengthy dialogs often exceed the input limits of language models. To address this issue, the authors proposed retrieving parts of these dialogs to create shorter input prompts, reducing their length to 10% of the original. They were using algorithms based on TF-IDF and BM25 (Robertson et al., 1995) to shorten the dialogs. Other similar approach was analyzed in another work (Zhong et al., 2021), where the authors proposed the Locator model, that retrieves relevant sequences of long meeting dialogs.

3. Methods

To have the best picks of dialogs for few-shot training, we consider the following features: Content, token count, and attendance. Each of them is characterized by a corresponding score ( $x$ ) and weight ( $w$ ). When evaluating CSIS performance, we are looking to find optimal configurations that lead to higher ROUGE scores (Lin, 2004).

3.1. Content

The content evaluation is based on the TF-IDF approach. chosen for its simplicity and adaptability across different datasets. Different variants of this approach can be further developed depending on the context. To define the tokens used in the TF-IDF computation, we employed the BERT tokenizer Devlin et al. (2019) from the Transformers library (Wolf, 2020). This choice was primarily driven by implementation convenience and seamless integration into our existing pipeline. The choice of tokenizer does not significantly impact the model’s performance in this context, as the TF-IDF metric tends to produce similar results regardless of the tokenizer used. Thus, we find the token distribution for each dialog and compute the TF-IDF weights, $w_{t d} = \frac{f_{t d}}{f_{m d}} \log \frac{N}{n_{t}}$ where $f_{t d}$ is the token $t$ frequency in dialog $d$ , $f_{m d}$ is the maximum frequency of a token in dialog $d$ , $N$ is the total number of dialogs in the dataset and $n_{t}$ is the number of dialogs in which one can find token $t$ .

The content score, $x_{c}$ , is given by the cosine similarity of the TF-IDF weight vectors of an input dialog and a sample from the dataset.

3.2. Token Count

Let us consider a dataset of dialogs, $A$ . We aim to find dialogues, $d \in A$ , having a similar length to that of the input dialog, $d_{0}$ . The average token count (ATC) of the sampled dialogs is the quantity we use to monitor computational costs, namely $ATC = \frac{1}{k} \sum_{i = 1}^{k} | d_{i} |$ where $| d_{i} |$ is the token count of a dialog sampled from $A$ and $k$ is the number of shots used in prompt tuning. Let $S = A \times d_{0}$ be the set of dialog pairs from which we search the training shots. For each selected pair $(d, d_{0})$ we compute a length similarity coefficient $x_{s}$ . This coefficient is given by a function $f$ which should obey the following conditions:

$f : S \to [0, 1]$

$| d | = | d_{0} | \Leftrightarrow f (d, d_{0}) = 1, \forall d \in A$

$\forall d_{r}, d_{l} \in A$ such that $| d_{r} | - | d_{0} | = | d_{0} | - | d_{l} | > 0 \Rightarrow f (d_{l}, d_{0}) ⩾ f (d_{r}, d_{0})$

The last condition increases the chances of choosing shorter conversations if there are multiple conversations with similar content. There is no reason to allow an increase in ATC if the content score is the same. If

A

contains mostly short conversations, there is a low probability of finding content similarities in a long dialog. In most cases, we can perform the search only through a pool of shorter conversations.

One can reach the maximum score if it picks a conversation of an equal token count. If such a sample cannot be found, we try to find another sample with a small error. Short conversations are advantageous for creating prompts with a low ATC. To prioritize shorter dialogs while still allowing longer ones to be selected occasionally, we require an asymmetrical curve that adjusts selection probabilities dynamically. For this purpose, we propose the asymmetrical double sigmoid (ads) function. It provides flexibility in shaping the selection curve and the added free parameters allow us to fine-tune the curve based on the available resources, which would be more challenging with simpler functions. Different ads trends can be obtained by modifying the free parameters $c_{1}$ - $c_{6}$ , $f (d, d_{0}) = c_{1} + \frac{c_{2}}{1 + \exp (- \frac{| d | - | d_{0} | - c_{3} + c_{4}}{c_{5}})} \times [1 - \frac{1}{1 + \exp (- \frac{| d | - | d_{0} | - c_{3} - c_{4}}{c_{6}})}]$ (1)The large number of free parameters allowed us to test different shapes and find a good set-up, i.e. a parameter configuration that achieves a favorable trade-off between model ROUGE scores and computational cost. We tuned the parameters using an iterative process involving grid search with a step of 0.05 and manual adjustments based on performance metrics such as ROUGE scores and ATC. The values of the parameters are shown in Table 1.

Table 1.

The Free Parameters of the Asymmetrical Double Sigmoid Function Used in Finding the Length Similarity Coefficient.

Parameter	Value
$c_{1}$	0.25
$c_{2}$	0.75
$c_{3}$	-10.2
$c_{4}$	75
$c_{5}$	10
$c_{6}$	5

3.3. Attendance

The number of participants attending the conversation is also a key factor. As described in Bokaei (2016), one obtains better results when the summarization system takes into account the number of persons involved in the discussion. For instance, conversations between two people are typically easier to summarize than those involving more participants. In this case, we should prioritize prompt examples with only two interlocutors, to maintain the same level of coherence as stated in the third heuristic. We have designed a function that sets a maximum score when the two dialogs have the same number of participants. The score decreases based on the relative difference in the number of participants, $| a - a_{0} | / a_{0}$ , where $a$ and $a_{0}$ are the number of participants engaged in the dialogs $d$ and $d_{0}$ , respectively. We calculate the attendance score by using an exponential, $x_{a} = e^{- | a - a_{0} | / a_{0}}$ In this way, we obtain a consistent score in the interval [0, 1], penalizing less an arbitrary difference in attendance if the reference dialog has a larger number of participants.

3.4. Weights

The final score is given by the weighted sum of the three scores, $s = ⟨ w, x ⟩ = w_{c} x_{c} + w_{s} x_{s} + w_{a} x_{a}$ (2)In creating the tuned prompt, we place the scored dialogs in ascending order. The last shot should be the one that scored highest because it has more impact on the completion returned by the model.

We perform tests for several weights while keeping the same GPT-3 configuration. In the GPT-3 configuration, we refer only to the engine, the number of shots, and the temperature value, set to 0. The temperature value measures how randomly the engine chooses the tokens in the generated summary. A low temperature implies picking the most probable tokens predicted by the model. Thus, a higher temperature value produces larger fluctuations in the results and the ROUGE scores we obtain. In our experiments we use zero temperature to avoid fluctuations.

4. Experimental Setup

We used the GPT-3 API¹ provided by OpenAI (Brown, 2020) to access the GPT-3 engines (text-curie-001, curie, curie-instruct-beta, ada, babbage).

4.1. Datasets

We briefly describe the datasets on which we performed the experiments. They are datasets used in the training of dialog summarization:

(1)
SAMSum Corpus—SCd, a Human-annotated Dialoe Dataset for Abstractive Summarization (Gliwa et al., 2019) is a dataset published by Samsung researchers in 2019. It contains 16,369 messenger conversations created by linguists. It is currently the largest provider of conversations similar to those discussed in online chats.
(2)
DialogSum—DSd, a Real-Life Scenario Dialog Summarization Dataset (Chen et al., 2021) is similar to the SAMSum Corpus, but spoken conversations are summarized instead. There are 13,460 summarized dialogs manually labeled.
(3)
MediaSum—MSd, a large-scale media interview dataset (Zhu et al., 2021) including a multitude of interview transcripts and their abstractive summaries. We use a part of this dataset (12,460 transcripts with up to 16 utterances) because of the input length limitations imposed by the LLMs used in the experiments.

4.2. Implementation Details

We perform preliminary processing for the dialogs used as shots in the prompt—the selection pool (SP). We construct the term frequency matrix and compute the inverse document frequency coefficients. Additionaly, we retrieve the token count and the number of people participating in the conversation for each dialog in SP. In our experiments, the SP comprises between 12 k and 14 k dialogs, depending on the dataset. We perform the tests using input dialogs from the testing part of each dataset. The data reduction of the SP is performed only once before testing. For each SP dialog, we compute and save the above-mentioned features.

4.3. Prompt Engineering

For each input dialog, CSIS assigns a score by comparing its features to each entry in the SP. Then, we retrieve the content and the summary for the highest k-scored dialogs, where $k$ is the number of shots. Finally, we construct the prompt by following the order imposed by the scores and generate a summary. The overall process is illustrated in Figure 1. The prompt is designed such that the summary corresponds with the prompt completion, i.e., the few-shot training samples are followed by the input dialog which has not yet been summarized. For testing, we repeat this procedure for 50, 100, or 200 runs of the model in the same configuration.

Figure 1.

A diagram that illustrates the overall scoring process and prompt generation.

The prompt includes the following components: An instruction hint, the selected samples, the input dialog, and the delimiters. We provide an example of a two-shot prompt in Appendix A.

4.4. Performance-Cost Trade-off

The function that computes the size coefficient significantly contributes to cost minimization. Similar to the standard deviation or the full width at half maximum (FWHM) an interval can be defined within the dialog picks to be scored high. The asymmetrical double sigmoid proposed in the previous section (ads broad) exhibits larger FWHM of 85.6 tokens. During our empirical study, we performed preliminary experiments on the SAMSum Corpus dataset to analyze different functions for obtaining the length similarity coefficient. In Table 2, we compare our proposed function with other possible options:

(1)
Random choice (not any cost constraints)
(2)
A normal distribution with $σ = 20$ tokens, meaning an FWHM of 47.2 tokens
(3)
A asymmetrical double sigmoid with an FWHM of 61.8 tokens (ads narrow)
(4)
A piecewise function for which $f (d, d_{0}) = 0$ for any $d$ and $d_{0}$ such that $| d | > | d_{0} |$
The results are derived by averaging the corresponding ATC and ROUGE scores of the outputs of 100 two-shot prompts using CSIS with $w_{c} = 0.5$ , $w_{s} = 0.3$ , and $w_{a} = 0.2$ on GPT-3 text-curie-001. We find that the broader ads function uses fewer computational resources, whereas the performance is not significantly affected ( $- 0.06$ decrease in ROUGE scores).

Table 2.
Comparison of Different Functions for Computing the Length Similarity Coefficient on the SAMSum Corpus Dataset.

Function FWHM (tokens) ATC (tokens) ROUGE-1 F1

Random picks N/A 228.04 40.90

Piecewise N/A 119.88 39.41

Normal distribution 47.2 229.63 41.76

Ads narrow 61.8 220.76 41.84

Ads broad (our choice) 85.6 215.96 41.90

Note. ATC: average token count; FWHM = full width at half maximum.

4.5. Grid Search on CSIS Weights

Function	FWHM (tokens)	ATC (tokens)	ROUGE-1 F1
Random picks	N/A	228.04	40.90
Piecewise	N/A	119.88	39.41
Normal distribution	47.2	229.63	41.76
Ads narrow	61.8	220.76	41.84
Ads broad (our choice)	85.6	215.96	41.90

We run several experiments with GPT-3 text-curie-001 for 50 tests each to identify the best set of weights. During these experiments, CSIS uses the ads-broad for the token count function and the exponential function for attendance as described in the previous section. We cannot decide on a single configuration of the weights because the tests are not numerous enough to conclude something statistical significant. We show the experimental results in Appendix B. Further experiments are performed on a smaller set of weights, by running 200 tests on each configuration. The results are presented in Table 3. In the last row, we show the scores obtained by using random prompts (no feature scoring is used). An increase in ROUGE scores is observed when using CSIS for all three datasets. The effect of CSIS is obvious in the datasets containing longer dialogs (DSd and MSd) the token count distribution can be seen in Figure 2). The ATCs of the datasets, $μ$ , along with their standard deviations, $σ$ , are provided in Table 3. For MSd, where there is a larger $σ$ , we see larger fluctuations in ROUGE scores and a tendency to obtain better results when the size scoring weight, $w_{s}$ , is larger. On the other hand, one can notice that the content similarity contributes more to choosing the best prompt for the first two datasets.

Table 3.
Experimental Results—Weights Configuration 200 Tests.

$w_{c}$ $w_{s}$ $w_{a}$ SCd $μ = 111.47$ $σ = 87.04$ DSd $μ = 174.25$ $σ = 88.48$ MSd $μ = 610.34$ $σ = 277.54$

R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L

0.1 0.5 0.4 39.88 15.54 31.27 36.7 13.55 29.2 39.04 19.18 29.81

0.1 0.6 0.3 39.68 15.73 31.37 37.05 13.38 29.6 38.13 18.43 28.87

0.1 0.3 0.6 39.71 15.45 31.07 36.68 13.21 29.25 38.25 18.79 29.23

0.5 0.3 0.2 39.83 15.2 31.51 37.51 13.8 29.7 37.89 18.43 28.41

0.3 0.4 0.3 40.38 16.07 31.83 37.26 13.62 29.36 37.51 17.93 28.46

0.6 0.3 0.1 39.8 15.07 31.15 37.22 13.61 29.67 38.14 18.34 28.54

Random picks 39.69 15.03 31.36 34.88 11.35 27.73 32.81 13.42 23.98

Note. SCd = SAMSum Corpus; DSd = DialogSum; MSd = MediaSum.

$w_{c}$	$w_{s}$	$w_{a}$	SCd $μ = 111.47$ $σ = 87.04$	DSd $μ = 174.25$ $σ = 88.48$	MSd $μ = 610.34$ $σ = 277.54$
			R-1	R-2	R-L	R-1	R-2	R-L	R-1	R-2	R-L
0.1	0.5	0.4	39.88	15.54	31.27	36.7	13.55	29.2	39.04	19.18	29.81
0.1	0.6	0.3	39.68	15.73	31.37	37.05	13.38	29.6	38.13	18.43	28.87
0.1	0.3	0.6	39.71	15.45	31.07	36.68	13.21	29.25	38.25	18.79	29.23
0.5	0.3	0.2	39.83	15.2	31.51	37.51	13.8	29.7	37.89	18.43	28.41
0.3	0.4	0.3	40.38	16.07	31.83	37.26	13.62	29.36	37.51	17.93	28.46
0.6	0.3	0.1	39.8	15.07	31.15	37.22	13.61	29.67	38.14	18.34	28.54
Random picks	39.69	15.03	31.36	34.88	11.35	27.73	32.81	13.42	23.98

Figure 2.

Datasets statistics on token counts.

5. Results and Discussion

5.1. Performance Evaluation of CSIS

5.1.1. Experimenting on Different Datasets

We experiment by running at different numbers of prompt tuning shots. As expected, the performance increases almost every time more tuning samples are provided to the prompt. The left subfigure of Figure 3 shows the ROUGE-1 score as a function of the number of shots. We repeated the experiment for each dataset. CSIS results consistently outperform their “random picks” counterparts for all datasets and shot configurations on average by 1.1 (SCd), 1.7 (DSd), and 7.3 (DSd) ROUGE-1 score units, demonstrating the effectiveness of the proposed scoring system in selecting better prompts. For MSd, while the absolute ROUGE-1 scores are lower compared to other datasets, the use of CSIS results in a more significant relative improvement over random picks, particularly at Three-Shots. MSd has a significantly larger mean token count ( $μ = 610.34$ ) and the widest variability ( $σ = 277.54$ ). This indicates a more complex dataset with longer dialogs, making it more challenging for models to summarize. The fact that CSIS shows the largest improvement on MSd suggests that the system is effective at managing the complexity of longer dialogs. On the other hand, SCd has the lowest ATC ( $μ = 111.47$ ) and a relatively narrow standard deviation ( $σ = 87.04$ ). Its compact dialogs are easier for the models to handle, leading to an increase between 4.4 and 8.1 units in the ROUGE-1 scores with both random picks and CSIS. This reflects that simpler and shorter dialogs are less challenging to summarize, and their summarization is less sensitive to the quality of shots, given the dialogs’ inherently easier structure.

Figure 3.

Left subfigure shows the content-, size-, and interlocutor-based scoring system (CSIS) performance evaluations with respect to the baseline of random generated prompts. The weight configurations are those highlighted in Table 3. The right subfigure presents the results obtained for different GPT-3 engines. In both cases it is shown the evaluation by ROUGE-1 scores (scaled between 0 and 1). The weights used by CSIS are $w_{c} = 0.5$ , $w_{s} = 0.3$ , and $w_{a} = 0.2$ .

5.1.2. Experimenting on Different GPT-3 Engines

We evaluate the performances of the ada, babbage, curie, and curie-instruct-beta engines of GPT-3 with and without using CSIS. As expected and shown in the right subfigure of Figure 3, the performance decreases as we move from larger engines (curie-instruct-beta) to smaller ones (ada), the ROUGE-1 score dropping by 12–18 units. Despite the performance drop with smaller engines, the use of CSIS consistently enhances results across all engines showing its effects in both resource-constrained and larger-scale scenarios. The best improvement of 2.1 ROUGE-1 score units on average is seen for the ada model, whereas for the curie-instruct model we notice an improvement of 1.1 ROUGE-1 score units on average.

5.2. Comparison to Fine-Tuned Models

We fine-tune a GPT-3 curie model for four epochs to compare its performance to that of the proposed prompt-tuning method. Different SPs between 50 and 2000 SCd samples are used for training the fine-tuned model. The results illustrated in Figure 4 show that fine-tuning outperforms our prompt-tuning method by 0.07 ROUGE-1 score units on average. Additionaly, fine-tuning can demand more computational resources. In four epochs, training requires spending between 45 k and 1600 k tokens depending on the SP size (the training set), whereas only prompt-tuning spends on average 221 tokens per query. Thus, approximately 400 queries based on prompt tuning can run using the same number of tokens elapsed in training the model on 100 samples. However, this can be a good investment if the fine-tuned model is used for a long-term period in a stable environment (meaning that the SP samples are still efficient as training examples after a period of time). Using prompt-tuning or fine-tuning is a decision that should be analyzed depending on the targeted performance, costs, and the expected queries the model will process.

Figure 4.

Comparison between prompt-tuning and fine-tuning performances with respect to the selection pool size.

In addition, using the same SPs we fine-tune a T5 model (Raffel et al., 2020) with 60 M parameters (t5-small), a BART model (Lewis et al., 2019) with 406 M parameters (bart-large-cnn), and a GPT-2 model (Radford et al., 2019) with 124 M parameters. We test the fine-tuned models on the same subset of 100 samples. In the case of the T5 (t5-small) and BART (bart-large-cnn) models, one can observe comparable performances to our approach.

5.3. SP Variations

The SP for each dataset remains unchanged during the experiments. However, to investigate whether CSIS results depend on the SP, we perform a series of different experiments that also include a significant amount of foreign data.

We mix SCd and DSd data to create a heterogeneous SP. The new SP consists of 54% of the SCd samples (14,732) and 46% the DSd samples (12,460). We run the model again for 100 tests with a curie-instruct-beta engine and the weights $w_{c} = 0.5$ , $w_{s} = 0.3$ , and $w_{a} = 0.2$ for the CSIS system. Additionally, we run the model only on tests from DSd. The results are shown in Table 4. As expected, in the case of one-shot training, the performance is slightly better for a larger SP. However, in a two-shot setting, the SP comprising only SCd data scores higher than the other SPs. There is an obvious difference in the score provided by CSIS between the DSd experiments and the others. This difference is actually a measure of the discrepancies between SCd and DSd.

Table 4.
The ROUGE Scores, Average Scores Provided by CSIS ( $s$ ), and ATC for Different Selection Pools and Numbers of Shots used in the Training.

SP $k$ R-1 R-L $s$ ATC

SCd 1 $39.0$ $31.5$ 0.645 109.40

SCd $+$ DSd 1 $40.8$ $32.3$ 0.6529 110.55

DSd 1 $40.4$ $32.1$ 0.4635 118.59

SCd 2 $41.9$ $33.4$ 0.6255 215.96

SCd $+$ DSd 2 $41.5$ $33.1$ 0.6335 216.34

DSd 2 $39.1$ $31.6$ 0.4377 242.62

Note. ATC = average token count; CSIS = content-, size-, and interlocutor-based scoring; SP = selection pool; SCd = SAMSum Corpus; DSd = DialogSum.

SP	$k$	R-1	R-L	$s$	ATC
SCd	1	$39.0$	$31.5$	0.645	109.40
SCd $+$ DSd	1	$40.8$	$32.3$	0.6529	110.55
DSd	1	$40.4$	$32.1$	0.4635	118.59
SCd	2	$41.9$	$33.4$	0.6255	215.96
SCd $+$ DSd	2	$41.5$	$33.1$	0.6335	216.34
DSd	2	$39.1$	$31.6$	0.4377	242.62

Software applications can be developed using CSIS we propose. To do so, one must consider a way of defining the SP. Different categories of users may use distinct SPs. Depending on the users’ behavior, the SP will also have to change over time. Consequently, a solution based on dynamic SP can be implemented. User feedback can also be used in updating the SP in two ways: (i) By providing better summaries and (ii) by setting up several preferences. Methods for gathering data to improve or form new SPs could be a continuation of this work.

5.4. Ablation Study

We remove one component of CSIS at a time and perform a series of tests varying the weights of the remaining features. These experiments are meant to prove the relevance of each feature considered by CSIS. We plot the ROUGE-1 score difference between a reference score and the new score, $Δ R$ , as a function of $w$ , the weight for one of the two features on which the tests are performed (Figure 5). The reference score for each dataset is obtained using CSIS configurations highlighted in Table 3. Each experiment takes into account 100 two-shot queries. We notice that always $Δ R < 0$ , so each feature is relevant to CSIS. In addition, the SCd plot shows the smallest impact for the size feature. A reasonable explanation for this is that SCd also comprises dialogs having a narrow size range. On the other hand, the MSd and DSd experiments indicate that attendance is the least impactful feature.

Figure 5.

Results of content-, size-, and interlocutor-based scoring system ( CSISDIFadd) ablation study: The ROUGE score difference ( $Δ R$ ) is plotted with respect to one of the weights of a component we do not remove from CSIS ( $w$ ). There is one plot for each dataset. The yellow line is the reference ROUGE score, which is computed when CSIS includes all the components. The shaded areas are defined by the error bars.

5.5. Information Retention

To examine if the models can leverage information from all provided dialogs present in ours prompts, we conducted an experiment analyzing how the removal of parts of the few-shot example dialogs affects summarization quality. Specifically, we randomly remove parts from the input dialogs and evaluated the performance using ROUGE-1 F1 score differences between the incomplete prompt and the full original prompt, $Δ$ R1. A negative difference in ROUGE scores suggests a decrease in performance. We measure the total amount of removed utterances by their constituent tokens. The experiment is based on GPT-3 curie model using the following CSIS weights: $w_{c} = 0.5$ , $w_{s} = 0.3$ , and $w_{a} = 0.2$ . We run 200 tests on SCd, 100 on MSd, and 100 on DSd.

We show the results in Table 5. The tests are grouped by intervals of token amounts removed. Performance generally declines for SCd and DSd, highlighting the importance of providing complete and well-structured few-shot examples for LLMs to achieve optimal summarization quality, with performance decreasing when data is omitted. On the other hand, we do not notice a substantial difference for MSd. One plausible explanation for this is that most of the dialogs in this dataset are much longer than in the other datasets, and removing a part of the prompt equivalent to 250 tokens does not have a significant impact.

Table 5.
Impact of Utterances Removal on ROUGE-1 F1 Scores Across Different Datasets (SCd, MSd, DSd). The Table Shows the Number of Tokens Removed, the Corresponding Changes in ROUGE-1 F1 Scores, and the Number of Tests for Each Interval of Removed Tokens.

Number of the removed tokens

Dataset 0–50 50–100 100–150 150–200 200–250

SCd Tests 67 63 52 14 4

$Δ$ R1 $- 0.13$ $- 0.39$ $- 0.52$ $- 0.47$ $- 0.24$

MSd Tests 13 15 21 23 28

$Δ$ R1 0.19 0.14 $- 0, 11$ 0.02 0.19

DSd Tests 22 22 26 26 4

$Δ$ R1 0.29 $- 0.09$ 0.04 $- 0.37$ $- 0.32$

Note. SCd = SAMSum Corpus; DSd = DialogSum; MSd = MediaSum.

	Number of the removed tokens
SCd	Tests	67	63	52	14	4
	$Δ$ R1	$- 0.13$	$- 0.39$	$- 0.52$	$- 0.47$	$- 0.24$
MSd	Tests	13	15	21	23	28
	$Δ$ R1	0.19	0.14	$- 0, 11$	0.02	0.19
DSd	Tests	22	22	26	26	4
	$Δ$ R1	0.29	$- 0.09$	0.04	$- 0.37$	$- 0.32$

5.6. Human Evaluation

We consider a human evaluation necessary because the main known problem of dialog summarization is when the summary provides incorrect references (Chen & Yang, 2020; Feng et al., 2021c), that distort the information from the original dialog, a phenomenon also known as hallucinations (Yichong, 2021; Zheng, 2020). These hallucinations cannot be identified through ROUGE scores, so it is imperative to have a second evaluation to validate the ROUGE results. The human evaluation is based on the following four criteria: Coherence—which is related to the overall quality of the sentences and how well is the summary structured, Consistency—which measures the factual information transfer, Fluency—it scores the quality of individual sentences (i.e., grammar, formatting, and so forth) and Relevance—it shows how important is the selected information, an excess of information is penalized. These criteria have been used in other studies as well, e.g. (Fabbri, 2021b) Table 6. We ask the annotators to rate different summaries for 100 dialogs on a scale from 1 to 5. The evaluation is blind, including the reference summary from the SAMSum dataset and the summaries generated by GPT-3’s curie-instruct-beta (with two shots, $w_{c} = 0.5$ , $w_{s} = 0.3$ , and $w_{a} = 0.2$ ) with and without applying CSIS. The results are presented in Table 7. The average scores of 3 $\times$ 100 dialog summaries are provided together with the number of dialogs rated between unit intervals. A significant increase due to the use of CSIS can be observed. For Relevance, we can see the largest difference in the average score. However, an improvement is visible for each criterion. The number of failures (i.e., poor summaries subject to hallucinations, rated under 2.5) decreases with 11% of the total number of dialogs after applying CSIS.

Table 6.
Example of Summary Scored low by ROUGE Metrics, but Evaluated as very Good by Human Judgment.

Dialog Will: hey babe, what do you want for dinner tonight?

Emma: gah, don’t even worry about it tonight

Will: what do you mean? everything ok?

Emma: not really, but it’s ok, don’t worry about cooking though, I’m not hungry

Will: Well what time will you be home?

Emma: soon, hopefully

Will: you sure? Maybe you want me to pick you up?

Emma: no no it’s alright. I’ll be home soon, i’ll tell you when I get home.

Will: Alright, love you.

Emma: love you too.

Source Summary R1 RL human evaluation

GPT-3 2-shot + Content-, Size-, and Interlocutor-based Scoring System (CSISDIFadd) Emma is not feeling well and doesn’t want to cook. Will is trying to make sure she’s ok. 25.80 19.35 4.25

SAMSum Emma will be home soon and she will let Will know. 100 100 3

GPT-3 2-shot Emma is not hungry and is not sure when she will be home. Will is worried about her. 48.27 34.48 4.25

Table 7.

Human Evaluation Results.

Statistics	SAMSum	GPT-3 + CSIS	GPT-3 random
Coherence	4.23	3.87	3.58
Consistency	4.11	3.53	3.22
Fluency	4.16	3.86	3.61
Relevance	3.97	3.29	2.80
Average Score	4.12	3.64	3.30
Between 4-5	68	47	37
Between 3-4	26	32	34
Between 2-3	15	25	32
Between 1-2	4	10	19
Failures (1–2.5)	9	22	33

The main reason for conducting a human evaluation is that ROUGE scores are not always reliable. We calculate the Pearson coefficient to evaluate the correlation between human judgment and ROUGE scores (Table 8). The scores indicate a weak correlation between the evaluation results. There are many examples, such as the one provided in Table 6, where very good summaries are rated with low ROUGE scores, namely at least by 10 ROUGE-1 score units under the average. This is due to ROUGE’s reliance on exact word matching, which can penalize synonyms or paraphrases that are semantically similar but not an exact match. This limitation is well documented in the literature (Nguyen et al., 2024; Schluter, 2017).

Table 8.

Pearson Correlation Coefficients Between Human Evaluation Results and ROUGE Metrics for Two set-ups: GPT-3 with content-, size-, and interlocutor-based scoring system (CSIS) and Without it (Random Prompts).

Set-up	Criterion	R1	RL
GPT-3 + CSIS	Coherence	0.3324	0.3122
	Consistency	0.2860	0.2719
	Fluency	0.2328	0.2454
	Relevance	0.3665	0.3284
	Average	0.3478	0.3296
GPT-3 random	Coherence	0.1423	0.2052
	Consistency	0.0884	0.1415
	Fluency	0.1853	0.2332
	Relevance	0.1693	0.2499
	Average	0.1683	0.2391

6. Conclusion

In this study, we investigated a prompt-based approach to improve the abstractive summarization of the dialogs. We showed that:

CSIS consistently outperforms random prompts across different scenarios, achieving an average increase in ROUGE-1 scores of $+ 0.7$ for SCd, $+ 2.6$ for DSd, and $+ 6.2$ for MSd. This improvement grows as the complexity of the dataset increases, highlighting CSIS’s effectiveness in managing complex data.

Fine-tuning the GPT-3 curie model on SCd yields better results than prompt-tuning with GPT-3 curie using SCd as SP (by $+ 7.0$ ROUGE-1 score units on average), but it requires significantly more computational resources. This makes prompt-tuning a less expensive option for scenarios with limited computational power, while fine-tuning may be worth the investment in stable environments with long-term use.

The SP choice plays a role in the model’s performance. Mixing datasets from different sources (e.g. SCd and DSd) results in slightly better performance in one-shot settings. This improvement can be attributed to the increased diversity of samples and a broader range of choice options, allowing the model to adapt better to varied input characteristics.

The ablation study confirms that all components of CSIS contribute positively to its performance. While size has the smallest impact in SCd due to its narrow token range, attendance is less impactful in more complex datasets like MSd and DSd, highlighting dataset-specific feature relevance.

We evaluated CSIS using ROUGE metrics and by conducting a human evaluation. In our experiments, small variations in the average ROUGE score corresponded to large discrepancies in the scores given by the annotators. However, both evaluations showed that applying CSIS increases the quality of the summaries. The human evaluation showed us that choosing similar dialog samples in creating the prompt increases the quality of the summaries reducing the number of failures by $11$ percents.

Limitations

The current method is applicable exclusively to LLMs that support few-shot training. It may be applied in selecting fine-tuning samples as well, but there should be a specific demand—a reference set of dialogs on which the scoring happens. The efficiency of the scoring system may be compromised when input dialogs significantly differ from the samples used to construct the prompt.

Footnotes

Acknowledgments

The authors would like to thank OpenAI for allowing us to access the GPT-3 API for academic purposes. We gratefully thank the annotators for their contribution in evaluating the summaries. We also thank to the anonymous reviewers for their valuable feedback and suggestions,which have improved the quality of this work.

ORCID iD

Elena Pelican

Authorship Contributions

Both authors contributed to the design and implementation of the research,to the analysis of the results and to the writing of the manuscript.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Funding

The authors received no financial support for the research,authorship,and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Data Availability

All data are publicly benchmark datasets,freely available on the net. The references are provided in the manuscript.

References

Adams

Fabbri

Ladhak

Lehman

Elhadad

(2023). From sparse to dense: Gpt-4 summarization with chain of density prompting.

Bokaei

M. H.

et al (2016). Summarizing meeting transcripts based on functional segmentation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 24(10), 1831–1841. https://doi.org/10.1109/TASLP.2016.2585859

Brown

et al (2020). Language models are few-shot learners. In Advances in neural information processing systems, Vol. 33, (pp. 1877–1901). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Chai

(2019). Deep learning in natural language processing: A state-of-the-art survey. In 2019 International conference on machine learning and cybernetics (ICMLC) (pp. 1–6). https://doi.org/10.1109/ICMLC48188.2019.8949185

Chen

Liu

Chen

Zhang

(2021). Dialogsum: A real-life scenario dialogue summarization dataset. In FINDINGS.

Chen

Liu

Yang

Zhu

Zeng

Zhang

(2023b). Unisumm and summzoo: Unified model and diverse benchmark for few-shot summarization.

Chen

Yang

(2020). Multi-view sequence-to-sequence models with conversational structure for abstractive dialogue summarization. In EMNLP.

Chen

Yang

(2021). Structure-aware abstractive conversation summarization via discourse and action graphs. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1380–1391). Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.109 https://aclanthology.org/2021.naacl-main.109

Chen

Zhang

Langrené

Zhu

(2023a). Unleashing the potential of prompt engineering in large language models: A comprehensive review.

10.

Chien-Sheng

et al (2021). Controllable abstractive dialogue summarization with sketch supervision. In Findings of the association for computational linguistics: ACL-IJCNLP 2021 (pp. 5108–5122). Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.454 https://aclanthology.org/2021.findings-acl.454

11.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805

12.

Dong

et al(2019). Unified language model pre-training for natural language understanding and generation. In H. M. Wallach et al., (Eds.), Advances in neural information processing systems 32: Annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada (pp. 13042–13054). https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html.

13.

Fabbri

A. R.

et al (2021b). SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9, 391–409.

14.

Fabbri

Rahman

Rizvi

Wang

Mehdad

Radev

(2021a). ConvoSumm: Conversation summarization benchmark and improved abstractive summarization with argument mining. In C. Zong, F. Xia, W. Li, & R. Navigli (Eds.), Proceedings of the 59th Annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers) (pp. 6866–6880). Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-long.535 https://aclanthology.org/2021.acl-long.535

15.

Feng

et al (2021a). Incorporating commonsense knowledge into abstractive dialogue summarization via heterogeneous graph networks. In Chinese computational linguistics (pp. 127–142). Cham. Springer International Publishing. ISBN 978-3-030-84186-7.

16.

Feng

et al (2021b). Language model as an annotator: Exploring dialogpt for dialogue summarization. 11th International Conference on Natural Language Processing (pp. 1479–1491). ACL.

17.

Feng

Qin

(2021c). A survey on dialogue summarization: Recent advances and new frontiers. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22).

18.

Gao

Fisch

Chen

(2021). Making pre-trained language models better few-shot learners. In ACL/IJCNLP.

19.

Gliwa

Mochol

Biesek

Wawer

(2019). Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization (pp. 70–79). https://doi.org/10.18653/v1/d19-5409

20.

Guo

et al (2019). Low-rank and locality constrained self-attention for sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12), 2213–2222. https://doi.org/10.1109/TASLP.2019.2944078

21.

Gupta

S. K.

(2019). Abstractive summarization: An overview of the state of the art. Expert Systems with Applications, 121, 49–65. https://doi.org/10.1016/j.eswa.2018.12.011

22.

et al (2021). Knowledgeable Prompt-tuning: Incorporating knowledge into prompt verbalizer for text classification. arXiv e-prints, page arXiv:2108.02035.

23.

Italiani

Frisoni

Moro

Carbonaro

Sartori

(2024). Evidence, my dear watson: Abstractive dialogue summarization on learnable relevant utterances. Neurocomputing, 572, 127132. https://doi.org/10.1016/j.neucom.2023.127132

24.

Jingqing

et al (2020). PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning, volume 119 of proceedings of machine learning research (pp. 11328–11339). PMLR. https://proceedings.mlr.press/v119/zhang20ae.html

25.

Khalifa

Ballesteros

McKeown

(2021). A bag of tricks for dialogue summarization. In Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 8014–8022). Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.631 https://aclanthology.org/2021.emnlp-main.631

26.

Klosowski

(2018). Deep learning for natural language processing and language modelling. In 2018 Signal processing: Algorithms, architectures, arrangements, and applications (SPA) (pp. 223–228). https://doi.org/10.23919/SPA.2018.8563389

27.

Lewis

Liu

Goyal

Ghazvininejad

Mohamed

Levy

Stoyanov

Zettlemoyer

(2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. https://doi.org/10.48550/ARXIV.1910.13461 https://arxiv.org/abs/1910.13461

28.

Lewis

P. S. H.

Perez

Piktus

Petroni

Karpukhin

Goyal

Küttler

Lewis

Yih

Rocktäschel

Riedel

Kiela

(2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401. https://arxiv.org/abs/2005.11401

29.

Wang

Lin

de Melo

(2022). Curriculum prompt learning with self-training for abstractive dialogue summarization. In Y. Goldberg, Z. Kozareva, & Y. Zhang (Eds.), Proceedings of the 2022 conference on empirical methods in natural language processing (pp. 1096–1106), Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.emnlpmain.72 https://aclanthology.org/2022.emnlp-main.72

30.

Lin

C.-Y.

(2004). Rouge: A package for automatic evaluation of summaries. In Workshop on text summarization branches out, post-conference workshop of ACL 2004, Barcelona, Spain (p. 74–81).

31.

Liu

Shen

Zhang

Dolan

Carin

Chen

(2021). What makes good In-context examples for GPT-

3

? arXiv e-prints, page arXiv:2101.06804. 10.48550/arXiv.2101.06804.

32.

et al (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv e-prints, page arXiv: 2104.08786.

33.

Mehnaz

et al (2021). GupShup: An annotated corpus for abstractive summarization of open-domain code-switched conversations. arXiv e-prints, page arXiv:2104.08578.

34.

Mike

et al (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7871–7880). Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.703 https://aclanthology.org/2020.acl-main.703.

35.

Nguyen

Chen

Pobbathi

Ding

(2024). A comparative study of quality evaluation methods for text summarization. arXiv e-prints, page arXiv:2407.00747. https://doi.org/10.48550/arXiv.2407.00747

36.

OpenAI (2024). Gpt-4 technical report. https://arxiv.org/abs/2303.08774.

37.

Qaiser

Ali

(2018). Text mining: Use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications, 181, 25–29. https://doi.org/10.5120/ijca2018917395

38.

Radford

Narasimhan

(2018). Improving language understanding by generative pre-training.

39.

Radford

Child

Luan

Amodei

Sutskever

(2019). Language models are unsupervised multitask learners.

40.

Raffel

Shazeer

Roberts

Lee

Narang

Matena

Zhou

Liu

P. J.

(2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67. http://jmlr.org/papers/v21/20-074.html

41.

Robertson

S. E.

Walker

Hancock-Beaulieu

Gatford

Payne

(1995). Okapi at TREC-4. In D. K. Harman (Ed.), Proceedings of the fourth text REtrieval conference, TREC 1995, Gaithersburg, Maryland, USA, November 1-3, 1995, volume 500-236 of NIST Special Publication. National Institute of Standards and Technology (NIST). http://trec.nist.gov/pubs/trec4/papers/city.ps.gz.

42.

Sammut

Webb

G. I.

(Eds.). (2010). TF-IDF (pp. 986–987). Springer US, Boston, MA.

43.

Schluter

(2017). The limits of automatic summarisation according to ROUGE. In M. Lapata, P. Blunsom, & A. Koller (Eds.), Proceedings of the 15th Conference of the european chapter of the association for computational linguistics: Volume 2, Short Papers (pp. 41–45), Valencia, Spain. Association for Computational Linguistics. https://aclanthology.org/E17-2007

44.

Shashi

et al (2021). Planning with learned entity prompts for abstractive summarization. arXiv e-prints, page arXiv:2104.07606.

45.

Tang

Puduppully

Liu

Chen

(2023). In-context learning of large language models for controlled dialogue summarization: A holistic benchmark and empirical analysis. In Y. Dong, W. Xiao, L. Wang, F. Liu, & G. Carenini (Eds.), Proceedings of the 4th New frontiers in summarization workshop (pp. 56–67), Singapore. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.newsum-1.6 https://aclanthology.org/2023.newsum-1.6

46.

Vaswani

et al (2017). Attention is all you need. In Proceedings of the 31st International conference on neural information processing systems, NIPS’17, (pp. 6000–6010), Red Hook, NY, USA. Curran Associates Inc. SBN 9781510860964.

47.

Wang

Yang

Wei

(2024). Learning to retrieve in-context examples for large language models. In Y. Graham & M. Purver (Eds.), Proceedings of the 18th Conference of the european chapter of the association for computational linguistics (Volume 1: Long Papers) (pp. 1752–1767), St. Julian’s, Malta. Association for Computational Linguistics. https://aclanthology.org/2024.eacl-long.105

48.

Wolf

et al (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45). Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-demos.6 https://aclanthology.org/2020.emnlp-demos.6

49.

Yichong

et al (2021). The factual inconsistency problem in abstractive text summarization: A survey. arXiv e-prints, page arXiv:2104.14839.

50.

You

(2023). Topic-informed dialogue summarization using topic distribution and prompt-based modeling. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the association for computational linguistics: EMNLP 2023 (pp. 5657–5663), Singapore. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.376 https://aclanthology.org/2023.findings-emnlp.376

51.

Zhang

et al (2020). DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual meeting of the association for computational linguistics: System demonstrations (pp. 270–278). Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-demos.30 https://aclanthology.org/2020.acl-demos.30

52.

Zhang

Zhu

Deb

Celikyilmaz

Awadallah

A. H.

Radev

(2021). An exploratory study on long dialogue summarization: What works and what’s next. In M.-F. Moens, X. Huang, L. Specia, & S.W.-t. Yih (Eds.), Findings of the association for computational linguistics: EMNLP 2021 (pp. 4426–4433). Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-emnlp.377 https://aclanthology.org/2021.findings-emnlp.377

53.

Zhao

T. Z.

et al (2021). Calibrate before use: Improving few-shot performance of language models. arXiv e-prints, page arXiv:2102.09690.

54.

Zhao

Guo

(2020). Improving abstractive dialogue summarization with graph structures and topic words. In Proceedings of the 28th international conference on computational linguistics (pp. 437–449), Barcelona, Spain (Online). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.39 https://aclanthology.org/2020.coling-main.39

55.

Zheng

et al (2020). Reducing quantity hallucinations in abstractive summarization. arXiv e-prints, page arXiv:2009.13312.

56.

Zhong

Yin

Zaidi

Mutuma

Jha

Awadallah

A. H.

Celikyilmaz

Liu

Qiu

Radev

D. R

(2021). Qmsum: A new benchmark for query-based multi-domain meeting summarization. CoRR, abs/2104.05938. https://arxiv.org/abs/2104.05938

57.

Zhu

Liu

Mei

Zeng

(2021). Mediasum: A large-scale media interview dataset for dialogue summarization. arXiv preprint arXiv:2103.06410.

		Number of the removed tokens
Dataset		0–50	50–100	100–150	150–200	200–250
SCd	Tests	67	63	52	14	4
	$Δ$ R1	$- 0.13$	$- 0.39$	$- 0.52$	$- 0.47$	$- 0.24$
MSd	Tests	13	15	21	23	28
	$Δ$ R1	0.19	0.14	$- 0, 11$	0.02	0.19
DSd	Tests	22	22	26	26	4
	$Δ$ R1	0.29	$- 0.09$	0.04	$- 0.37$	$- 0.32$

Dialog	Will: hey babe, what do you want for dinner tonight?
	Emma: gah, don’t even worry about it tonight
	Will: what do you mean? everything ok?
	Emma: not really, but it’s ok, don’t worry about cooking though, I’m not hungry
	Will: Well what time will you be home?
	Emma: soon, hopefully
	Will: you sure? Maybe you want me to pick you up?
	Emma: no no it’s alright. I’ll be home soon, i’ll tell you when I get home.
	Will: Alright, love you.
	Emma: love you too.
Source	Summary	R1	RL	human evaluation
GPT-3 2-shot + Content-, Size-, and Interlocutor-based Scoring System (CSISDIFadd)	Emma is not feeling well and doesn’t want to cook. Will is trying to make sure she’s ok.	25.80	19.35	4.25
SAMSum	Emma will be home soon and she will let Will know.	100	100	3
GPT-3 2-shot	Emma is not hungry and is not sure when she will be home. Will is worried about her.	48.27	34.48	4.25