Abstract
Keywords
Introduction
Settling a dispute by arguing from different value-based viewpoints can sometimes be effective and sometimes sterile. For example, in the field of justice, it is common to see that a case well founded in evidence is lost due to arguments that point to procedural errors. The appeal is usually effective because for the audience of the courts of justice, the rules of judicial procedure have a higher value than the evidence. On the side of sterile argumentation, there are often dead end debates on the legalization of abortion, with pro arguments from, for example, the value frameworks of women rights and collective health, and con arguments from the frameworks of religion and the rights of the unborn. In this case, the argumentation tends to be sterile due to disagreements on the superiority of one value framework over the other.
Value-based argumentation frameworks (VAFs) [11,13,14,17,19,20,33] are computational models of persuasion intended to serve particular areas of practical reasoning and decision-making such as ethics and law. Drawing inspiration from Chaim-Perelman’s rhetorical theory, Bench-Capon [13,14,17,19] considered the strength of arguments as varying according to the values they promote and the assessment of those values by the particular audience they address. An argument is defeated if it is attacked by another argument promoting a value which is at least as preferred (by the audience) to the value it promotes. However, the argument is strong enough to withstand attack if the value it promotes is preferred. Then, argument interactions through an attack relationship can be analyzed to find the best justified arguments for the particular audience. This intuition facilitates the prescription of normative guidelines for the computational treatment of persuasion through argumentation.
Nonetheless, we could also think of VAFs as a theory about how persuasion occurs in real people, and psychological experiments could provide empirical support for the underlying intuitions. The broad approach we are proposing is not new. Bench-Capon’s model is based on Dung’s more general abstract argumentation frameworks [31], about which some empirical studies have been conducted to test certain “principles” and semantics for argument acceptance as common intuitions of ordinary people (see Section 2). However, as far as we know, the approach for the special case of VAFs is novel.
In this paper we reported a series of experiments carried out to basically confront the underlying intuitions of VAFs with those of ordinary people. We emphasize that the objective is
The experiments we described here enabled us to observe that people’s argument acceptance is more correlated with preferences among the values that the arguments promote than with specific forms of interaction among the arguments through an attack relationship. Moreover, the results showed that the relative importance of the values is assessed with different degrees, and the same argument can promote various values with diverse strengths. This is in line with Bench-Capon’s recent claim [16] that a single ordering on values does not adequately capture the argument strength if there are several potential audiences. We also found that the evaluation of the interaction among arguments can be modulated by framing effects, and people usually change their perception on the relative importance of values depending on biases that incline their feelings towards conclusions. This finding is consistent with well-known bias effects reported by Kahneman, Tversky and others in the Heuristics and Biases program (e.g., [35,37], etc.).
The article is structured as follows. In Section 2 we review related works on empirical approaches to argumentation frameworks. Section 3 summarizes the formal definitions of VAFs and their semantics. In Section 4, we describe four experiments, analyzing and discussing their results individually. Finally, a general discussion and concluding remarks are offered in Section 5.
Related work
The approach of testing computational argumentation systems as models of human argument assessment has not been much explored, but is gaining increasing interest. VAFs are build on Dung’s argumentation frameworks [31], extending the model by adding a set of values and a function that assigns one value to each argument. The first study on Dung’s argumentation frameworks model as a descriptive theory was reported by Rahwan and colleagues [45]. The aim was to test the
Cerutti et al. [25] found that the acceptability of arguments in human subjects corresponds mostly to the skeptical semantics of Prakken and Sartor’s model [44]. However, they also observed important deviations that seem to arise from the implicit knowledge of subjects about the domain in which the evaluations are carried out.
Rosenfeld and Kraus [46] identified problems with more basic intuitions, such as conflict-freeness. All Dung’s extension semantics satisfy this property, which indicates that the set of arguments accepted by a rational agent has no internal conflicts, that is, that the accepted arguments do not attack each other. Nevertheless, the authors found a significant percentage of surveyed individuals who accepted conflicting arguments, which they explained in part by the fact that some people try to appear impartial and good referees, capable of considering the different positions. They also assumed that some subjects may not accept defeat by arguments that promote a value regarded as weaker than that of the attacked argument (in line with the Bench-Capon value-based model, which is a specific object of study in this paper). In a later work [47], the authors analyzed the argumentative behavior of more than 1000 individuals (on different discussion subjects such as death penalty, flu vaccination or jury trials) and, using machine learning algorithms, found that it is possible to predict the choice of arguments with a high percentage, especially by combining a relevance heuristic based on distance measures among arguments. They then contrasted those results with Cayrol and Lagasquie-Schiex’s bipolar argumentation theory [24] (an extension of Dung’s model in which, in addition to an attack relationship, another support relationship is included), identifying that its predictions are not adequate.
Polberg and Hunter [41] also noted defects in Dung’s model in explaining human argumentative behavior, and considered the uncertainty arising from subjects’ opinions about both the arguments and the structure of the framework’s graph to be of crucial importance. In relation to more optimistic results, Toniolo, Norman and Oren [50] found good predictions in the probabilistic semantics model of Thimm [49] and Hunter and Thimm [36]. These semantics estimate the probability of accepting an argument. First, the probability that a set of arguments is an extension for a given semantics is computed, and then the probability of accepting an argument is calculated as the sum of the probabilities of all those possible extensions to which it belongs. The experimental result was that people tend to agree with the semantics in terms of the credibility they attach to the conclusions of the higher probability arguments.
Cramer and Guillaume [30] argued that studies detecting a correspondence between human behavior and the semantics of argumentation frameworks are limited to contexts of a few arguments, hence such results cannot be generalized. Using more complex scenarios, they showed that part of the participants chose cognitively simple strategies that coincide with Dung’s grounded semantics when analyzing strong acceptance (that is, arguments belonging to all the extensions), while others adopted more cognitively demanding strategies that are well predicted by CF2 semantics [7].
Recently, Bezou-Vrakatseli [23] replicated the findings of [45] on argument reinstatement, but obtained results contrary to some explanatory hypotheses about people’s behavior.
To the best of our knowledge, no research similar to those described above has been conducted with respect to VAFs. There are, instead, several works in favor of
Value-based argumentation frameworks (VAFs)
We begin by summarizing the basic definitions of the model we are going to test. The primitive concepts are
([31]).
An
Arguments interact through the attack relation. To determine which arguments survive such an interaction, different “semantics” can characterize the justification or warrant. As a result, each semantics sanctions a class of sets of argument, the
([31]).
Given an argumentation framework
Accepting arguments belonging to
Bench-Capon’s VAFs are extensions of argumentation frameworks that take into account values associated with arguments. To that aim, a set of values and a function mapping arguments with values of the set are added.
A Here we follow the lines of [19]. From here on, we refer to an arbitrary but fixed audience
Different audiences for the same VAF may make dissimilar assessments of the arguments, according to their particular preference orders of values. Particularly, an attack by an argument
For arguments An argument A subset of arguments A subset of arguments A subset of arguments An argument An argument
VAFs retain the abstract character of Dung’s argumentation frameworks, which facilitates the instantiation of several structured models of argument. For instance, the ASPIC+ framework [43] and the DeLP system [34] are expressive enough to specify the internal structure and content of arguments, and allow for the use of explicit preferences to resolve attacks. Moreover, VAFs can instantiate argument schemes for practical reasoning, which is useful in representing legal case-based reasoning and public policy-making [4].
As we carried out the experiments reported below, some hypotheses arose about variations on VAFs that could account for the obtained results. For the sake of readability, we introduce those variations in this section and will refer to them in due course.
MVAF
The first variant concerns binding each argument to multiple values. This leads us to define the following model:
A
Unlike VAFs, this model allows representing arguments that promote several values. Accordingly, we modify the notion of defeat as follows:
Let
In other words, A defeats B if A attacks B and no value promoted by B is preferred to all the values promoted by A. This means that if B promotes some value that is preferred to all the values promoted by A, then the attack does not succeed.
The second variant refers to the possibility of assigning diverse importance degrees to the values, so that each argument is assigned a strength that results from the sum of the importance degrees of all the values that the argument promotes.
A for each
In words, (3) means that
The notion of defeat based on the strength of arguments is defined as follows:
Let
Simply put, A defeats B if A attacks B and B is not “stronger” than A.
All the experiments reported here were conducted under a general protocol approved by the Ethics Committee of the City Hospital of Bahía Blanca, Argentina. We recruited students from several groups for each experiment and no participant had a background in logic or argumentation theory.
We were mainly interested in finding out if people’s assessments of arguments in a value-based argumentation scenario are related to their perceptions about attack among arguments, defeat and value preferences, consistently with Bench-Capon’s intuitions. The first experiment was exploratory, and its results enabled us to propose a set of hypotheses that we tested in subsequent experiments.
Experiment #1
Participants were 64 voluntary undergraduates from different fields of study at the Universidad Nacional del Sur, Bahía Blanca, Argentina. Since we had no hypotheses regarding either gender or age, we did not collect data on those variables in these experiments. However, we knew it was a balanced representation by gender, while the vast majority of participants were between eighteen and twenty years old.
The experiment was conducted by means of an online single-variant Google form. We used a scenario similar to the one examined in [14],3 This scenario, introduced by Coleman in [28], was also discussed in [13] but with a different representation. The text was in Spanish. The characters in the original story are Carla and Hal, but we changed to two female characters to avoid the possible influence of a gender bias.
First, we asked to identify the values promoted by each argument by choosing an option among “Right to life,” “Right to property,” “Right to life and right to property,” and “Neither.” Second, we requested to express preferences between the values regarding the situation in question, by comparing them in terms of “more important than,” “equally important,” and “I don’t know.”
Then, participants were asked to identify compatibilities and incompatibilities between the arguments, instead of detecting attacks. This is because we assumed that, for an audience not specialized in argument systems, the term ‘attack’5 In Spanish,
Finally, participants were requested to evaluate arguments and conclusions. For arguments, we asked the following question:

VAF of the insulin scenario according to Bench-Capon (taken from [14]).

Percentage of individuals who perceived each defeat. In parenthesis, percentage of participants who also observed both incompatibility between the arguments and non-preference for the value promoted by the defeated argument. For simplicity, we recorded only those defeats (arrows) observed by at least 20% of individuals.
We start by checking the perception of defeat. Figure 2 shows the percentages of participants that perceived each defeat and, among them, those who considered that both conditions were satisfied. Present participants’ intuitions about defeat were fairly consistent with Bench-Capon’s notion in some cases, but not in others. By definition, defeat implies attack (at least, incompatibility, following our hypothesis) and non-preference of the value promoted by the defeated argument, but this implication was only verified to a good extent for the pairs of arguments (A, B), (B, A), (C, B), and (C, D). Overall, 238 defeats were reported, and participants identified an incompatibility between the arguments in tension in 159 of those 238 cases (67%). Therefore, most participants’ opinions are consistent with the notion that people should perceive incompatibility between a defeated argument and its attacker. Having said that, it is noteworthy that a significant minority of judgments (33%) did not conform to that premise. In addition, it is clear that the representations of the scenario as a VAF that arise from the opinions of participants are mostly at odds with the Bench-Capon’s representation given in Fig. 1.
As reported in Table 1, the experiment allowed us to observe that people quite clearly identify the values in agreement with Bench-Capon’s opinion when they are explicit in the premises (arguments A and B), but not when they are not (arguments C and D). Of the total number of participants, 72% linked argument A exclusively to the right to life, and 94 % related argument B exclusively to the right to property. The case of argument C (“Martina can do that if she replaces Carla’s insulin after the emergency”) is surprising. In our opinion (and in line with Bench-Capon), it promotes the value of private property, but only one participant associated the argument exclusively with that value, while 31% identified it with both life and property rights, 30% did not link it to any of those values and, strangely, 38% related it
Experiment #1. Perception of values promoted and argument acceptance
Opinions on winning arguments are summarized in Table 2. The “extension”
Experiment #1. Arguments chosen as winners and values promoted according to majority judgment, and claimed value preference
Regarding the conclusions, the responses were consistent with the selection of arguments. The acceptance (resp., non-acceptance) of an argument given the answers “agree” or “totally agree” (resp., “totally disagree,” “disagree,” or “neither agree nor disagree”) with respect to its conclusion was observed in 86% of the cases (221 out of 256). Conversely, the agreement (resp., disagreement) with a conclusion given the acceptance (resp., non-acceptance) of its supporting argument was observed in 98% of the cases (251 out of 256). The difference seems to make sense given that, in some situations, people are inclined to accept a conclusion even if they do not accept some supporting arguments.
Next, we assessed the extent to which VAF can be used to make correct predictions about argument acceptance and compared that with other variables. In each case, we took into account perceived defeats, perception of what values the arguments promote, and value preferences. We checked the following variants:
VAF (Definitions 3 and 4). We verified only those cases in which each argument was associated with
VAF (I). It could be the case that the participants’ intuitions about ‘defeat’ did not coincide with the expected interpretation. If this is the case, the appreciation of incompatibility (I) between the arguments in that relationship would suggest a more precise correspondence. Consequently, we checked if considering defeat together with incompatibility perceptions made a significant prediction difference with respect to considering defeat perceptions alone.
MVAF (Definitions 5 and 6) In this model, arguments can be perceived as promoting
MVAF (I). The same as MVAF, but taking incompatibility as a reinforcement of defeat.
AF (Definitions 1 and 2). Dung’s simple argumentation frameworks model. We intended to know to what extent that model predicts differently than VAF and MVAF. Here, we assimilated defeats to attacks, regardless of value assessments.
AF (I). The same as AF, but taking incompatibility as a reinforcement of defeat perceptions.
V. Finally, we verified to what extent the promoted value and the preference for that value, regardless of any other variables, is enough to make correct predictions.
Results are displayed in Table 3, which shows that differences are in a small range of around 5.5 percentage points. As can be seen, VAF worked relatively well, but the best performance was obtained by taking into account only the promoted values and value preferences. Beyond mean differences, the overlap of the confidence intervals evidences the closeness of the predictive efficacy among all models.
Experiment #1. Percentages (mean) of successful predictions with confidence intervals
A reviewer also suggested to investigate logical fallacies that could affect the perception about values.
A different source of doubt regarding the association of arguments with values lies in a possible inadequate design of our experiment. In view of the results, the option “right to life
We compared VAF-based predictions with several variants, obtaining similar results. However, the best performance was that of the model that considers value preferences as the only relevant variable. This suggested that improved measurement of appreciations relative to values could give us better differentiated results. Likewise, we suppose that intuitions about defeat might be better tested in simpler scenarios where attacks are more clearly one-way directed and value promotion is more easily identifiable.
We can also analyze other sources of representational problems. The directionality of the attack relation is certainly a very important issue. Bench-Capon himself was aware of the problem and he did not have a definite solution. In fact, the insuline scenario, which he represented in [14] by a linear relation where D attacks C attacks B attacks A (the representation we used in our study), was instead represented in [13] by a 3-cycle where A attacks C attacks B attacks A. Moreover, in [12], the author said the following: Consider the trite example of the arguments ‘Kerry can fly because she is a bird’ and ‘Kerry cannot fly because she is a kiwi.’ Here, although we have contradictory conclusions, we would naturally say that the second argument attacks the first, but not vice versa. Sometimes, therefore I shall represent what might in logic be considered a mutual attack as an attack in one direction only, when it seems clear that this is how the arguments are intended. By giving ourselves this amount of freedom, we should be able to construct the most natural representation possible.
Another sign of a possible representation issue is the low acceptance of argument D (and of its conclusion). This might suggest the intervention of an unspoken argument in the minds of the participants, one that defeats D. Given that D considers that a person with limited economic resources would have the right to take the insulin without the obligation to compensate the owner, the argument could be interpreted as referring to exceptional cases, and that the information in the context does not indicate that Martina’s case fits such an exception. This suggests, at least, three representational rectifications of the models. One is to take into account the incidence of some argument E attacking D, based on considerations of irrelevance for the case in question. This fifth argument would promote, in turn, the value of information, or the lack of it, prevailing over the others. The argument D can raise some critical questions that can be expressed in the form of arguments, and that can be represented through structured argumentation versions of VAF (see, for example, [2,3,18], and [8] for representing exceptions). A second rectification consists of simply deleting D from the representation, while a third one consists of considering an attack from C to D. All these rectifications would explain the high esteem for argument C, whose only attacker, D, would be ignored or rejected.7 Note that those who considered that A attacks C tended not to see any incompatibility between these arguments (Fig. 2).
These possibilities are in line with [26], where the authors claimed that, in order to create argumentation systems, designers must take into account implicit domain-specific knowledge or beliefs. Maybe these representational problems are more important than the extent to which the empirical data match some semantics,8 This opinion was subscribed by an anonymous reviewer.
In view of the above results, we planned to test the following hypotheses:
H1: The strength of arguments is assessed as a function of the various degrees of importance conferred to the values the arguments are perceived to promote (not just as a function of a preference order), and different values can converge in the same argument with dissimilar strength degrees. H2: When deciding on a given debate, people tend to consider unspoken arguments. H3: The preference of the value promoted by an attacked argument may have the effect of avoiding defeat, but the attacking argument is not acceptable together with the attacked one since the conflict persists. H4: Acceptance of value-based arguments can be modulated by biases and framing effects.
Hypotheses H1 and H2 were tested in Experiment #2, while H3 and H4, in Experiment #3 and Experiment #4, respectively.
Sixty-two undergraduate students from different fields of study were voluntarily recruited at the Universidad Nacional del Sur, Bahía Blanca, Argentina. In order to test hypotheses H1 and H2, we maintained the questionnaire from the first experiment, except for the questions that involved the variables relevant to the hypotheses, namely, association of arguments with values, varying degrees of importance assigned to values, and possible influences of external arguments. Regarding hypothesis H1, participants were asked to express their beliefs about the value(s) each argument promotes. To answer, they had to mark cells in a table with four rows headed by the names of the arguments (A, B, C, D) and two columns titled “right to life” and “right to property”. Then, participants were requested to indicate the importance of the values with respect to the referred situation (the insulin case), each one on a 10-point scale where 1=“Not important at all,” and 10=“Absolutely important.” Regarding hypothesis H2, participants were asked to answer the following question:
Unlike bipolar argumentation models [24], supporting arguments do not play a role in VAFs. However, if the participants expressed supporting arguments, then we would have, on the one hand, a better understanding of their behavior and, on the other hand, some evidence that would show the limits of VAFs to correctly represent the situation.
Experiment #2. Perception of values promoted and argument acceptance
To test H1, we formulated the hypothesis in a more precise way by taking into account the following operationalization of functions, where X
Then, we reformulated the hypothesis as follows:
∙ H1: The set of arguments accepted as winners by the individual
Then we analyzed a quantitative version QV of model V, that predicts the choice set
Additionally, we aimed to analyze the quantitative version QMVAF (Definitions 7 and 8), with the variants of considering defeat perceptions, on the one hand, and defeat reinforced with incompatibility perceptions (QMVAF (I)), on the other.
The comparison of correct prediction rates is displayed in Table 5. As in Experiment #1, we tested VAF-based predictions only in those cases in which the arguments were perceived as promoting at most one value (
Experiment #2. Percentages (mean) of successful predictions with confidence intervals
Regarding H2, i.e., the hypothesis that participants considered the incidence of external arguments, 19 participants (31%) introduced new arguments or, at least, some comments, either for explaining their decisions or for criticizing some point of view. Among them, we identified 11 out of 19 arguments either presenting objections to argument D or supporting the contrary conclusion, i.e., Martina must replace the insulin. Here are some examples:
In either case, we conclude that the participants who introduced those arguments considered some explicit or implicit defeat to D.
(1) When defeat and incompatibility were considered together to predict accepted arguments (I), predictions were less accurate than when defeat alone was taken into account (in contrast to results in Experiment #1). We could think that this issue is due to the fact that, when we considered defeat and incompatibility together to predict acceptance, we ended up establishing a stronger criterion for making a prediction (to wit, a perceived attack was dismissed if it was not accompanied by a perceived incompatibility between the attacker and the defeated argument). This criterion seemingly leads the models to avoid taking into account weak signals that have predictive power.
(2) Model V achieved a lower performance than in Experiment #1 (though it improved somewhat in the quantitative version QV). Unfortunately, our possible explanation for (1) does not explain (2). Alternatively, we could relate both issues to how we inferred the participants’ preference among the values, which is the only methodological variation between the experiments. Whereas in Experiment #1, we directly asked them to choose whether they prefer one value over the other or were indifferent (i.e., a comparative framework), in Experiment #2 we requested the participants to numerically express the level of importance of each value separately (non-comparative framework). This might have weakened the predictive power of the values in Experiment #2, since that information affects validations of both defeat and value preference. Although we do not have an explanation of how the information difference could generate the effects of (1), it provides a good reason to account for (2). Indeed, some classic findings in experimental psychology (e.g. [38]) show that when individuals evaluate alternatives numerically, they do not necessarily make comparative judgments, and preferences inferred from those judgments may be reversed compared to more direct choices.
Arguments were more clearly associated with values than in the first experiment. Though, in some cases, a single value assignment tendency was appreciated, in concordance with Bench-Capon’s model, some arguments were still associated to both values. The perception of the values promoted by arguments and the measure of their importance were still correlated with argument acceptance regardless other variables intrinsic to the model (such as attacks, defeats, conflict-freeness, or any usual extension semantics), which makes V and QV simple estimation methods for argument acceptance. Predictions include those cases that cannot be represented as VAFs because arguments are perceived as promoting more than one value. In cases that can be represented as VAFs and values can be ordered asymmetrically, Bench-Capon’s model performs something better (67% vs. 58%). Still, V and QV are more parsimonious, in the sense that they suggest less complex computations with fewer variables.
In the previous experiments, we were not able to determine on a good basis how participants perceived attack directions, independently of incompatibilities. That was in part due to the fact that the arguments’ conclusions in that scenario were contradicting, which could suggest an attack in either direction. As Bench-Capon [15] noted, knowing the type of attack can be crucial.10 An attack against a conclusion is known as a
Fifty-six volunteer students from different disciplines at the Universidad Nacional del Sur, Bahía Blanca, Argentina, participated in the experiment. Paper questionnaires were delivered by hand. Unlike the previous experiments, in the scenario of the present study we used different arguments A and B in such a way that B denies that the evidence can be used to condemn X. Therefore, B is an undercutting defeater of A. Moreover, in the wording we used, B begins with ‘However’, which suggests that B is a counterargument of A, hence, favoring a one-way interpretation of the attack:
Note that we could not have gotten the desired interpretation if we had asked about an incompatibility instead of an attack. If participants were able to understand that B attacks A and not the other way around, then we would have a good basis for analyzing defeat as dependent on a correct perception of the values promoted and preference over those values, in accordance with VAF’s underlying intuition. In addition, we asked about the comparative preference between the promoted values (as in Experiment #1, to avoid the possible effect that occurred in Experiment #2) and about any other possible values that the participants would consider to bias their choice. The questionnaire was as follows:
Experiment #3. Tendencies in participants’ opinions
Now, according to VAF, we have the following predictions. If participants only perceived the attack from B to A, then:
(i) if the value promoted by A (evidence) is not more important than the value promoted by B (legality), then B defeats A; hence, participants should accept B and reject A;
(ii) if the value promoted by A (evidence) is more important than the value promoted by B (legality), then B does not defeat A; hence, participants should accept
In either case, B should be accepted, i.e., B should be
In Table 7, we compared the VAF-based predictions and H3 with respect to (i) under two different conditions: (a) assuming the asymmetric attack from B to A as a fact, no matter what participants said about that, and (b) taking into account only cases in which participants recognized that asymmetric attack. In our data, the group that is crucially relevant to compare the hypotheses was formed by 27 participants who accepted that (1) A promotes evidence, (2) B promotes legality, and (3) evidence is preferred or indifferent to legality. According to VAF, the participants accepting these three conditions would still accept B, while H3 predicts the opposite (note that, if condition (3) is not fulfilled, then the acceptance of B would not contradict any hypothesis).
For (a), V predicts that individuals will not accept both A and B, which could be interpreted to mean that the arguments are anyway considered in conflict, in concordance with H3. Consequently, they should choose
For (b), H3 had a prediction efficacy of 81%: 13 out of 16 participants chose only one argument. Moreover, 5 out of 6 (83%) chose A when preferred evidence, and 57% (4 out of 7) and 43% (3 out of 7) chose A and B, respectively, when declared indifference between the values. In contrast, we registered a VAF-based prediction efficacy in 25% (4 out of 16): 1 out of 7 (14%) participants opted for both A and B when they preferred evidence over legality, and 3 out of 9 (33%) participants chose B when declared indifference between the values. Finally, when legality was perceived as more important than evidence, and provided that participants identified A with evidence and B with legality, both VAF and H3 had identical (low) efficacy: both succeed in the same 4 samples out of 8 (50%) under condition (a), and in the same 2 samples out of 6 (33%) under condition (b). However, if we ignored how participants identified the arguments and the values and just observed the value-argument correlation assuming that A promotes evidence and B promotes legality, both VAF and H3 were 73% successful in predicting the choice of B when legality is preferred.
Experiment #3. Successful argument acceptance prediction given B′s attack on A
Next, we considered the acceptance predictions in the frameworks elicited by the participants’ opinions, comparing VAF, AF, and V (Table 8). As can be seen, V is the model with the highest predictive accuracy and is the only one of the three that performs above chance level.
Experiment #3. Percentages (mean) of successful predictions with confidence intervals
Regarding questions (7)–(9), we only obtained three answers, which is insufficient to shed any light on the behaviors.
It can be reasonably argued11 A reviewer raised this criticism.
Regarding the low prediction efficacy of both VAF and H3 when participants preferred legality to evidence, we can only provide speculative explanations. A possible psychological reason is that participants could declare preference for the “correct” value of legality, according to the rule of law, while their sincere feelings are either on the side of evidence or balanced between both values. More generally, the content of the arguments could induce the occurrence of biases and framing effects that modulate the expected incidence of value preferences. In particular, we used a scenario where conviction was under discussion, which could generate a leniency bias that affected the evaluation of the arguments. Experiment #4 is proposed to test this bias in different framings, as a classic challenge to normative effects on judgment and decision making [51].
Finally, the same outcome as that of model V could be attained by a VAF in which there is a bidirectional attack between A and B. So, a possible explanation for this outcome could be that the participants really feel the conflict to be a bidirectional attack, but do not indicate this in the questions about attacks, because they do not correctly understand the meaning of the specialist term ‘attack’ (or are misled by the words “is used to,” as discussed before). As it can be seen, the problem of the directionality is recurrent.
We recruited 74 undergraduate students from different disciplines at the Universidad Nacional del Sur and the Universidad Salesiana, Bahía Blanca, Argentina. The questionnaires were completed online in Google Forms. To test the hypothesis that the acceptance of value-based arguments can be modulated by biases and framing effects (H4), we designed two frames (scenarios) with identical representation as value-based argumentation frameworks and equal values, but inverting conclusions with opposites (acquit/convict):
(F1).
(F2).
The same questionnaire followed each frame:
According to our intuition, A, B, A′, and B′ all promote the value of evidence, while both C and C′, the value of procedural legality. On the other hand, B and B′ pose asymmetrical (specificity-based) attacks against A and A′, respectively, while C and C′ generate asymmetrical (undermining) attacks against B and B′, respectively (Fig. 3).

VAFs with identical structure representing the frames used in Experiment #4.
We tested two conditions, C1 and C2, only varying the order of presentation of the frames: in C1 (
One point to observe was to what extent responses about the second frame showed different intrasubject perceptions of arguments and/or values with respect to the first frame. One variable of interest was the “positional concordance” among the arguments chosen in one frame and the other, and we classified it into three classes: a) strict concordance: argument X was chosen in one frame iff X’ was chosen in the other frame (i.e., the arguments chosen occupy the same position in their respective graphs); b) weak concordance: the selection of arguments changed regarding their positions but there is no conflict (attack) between their positions (i.e., from A to C′, or from C to A′); and c) non-concordance: the choice of arguments changed regarding their positions and there is conflict (attack) between their positions (i.e., from A to B′, from B to A′, from B to C′, or from C to B′).
VAF predicts the following behaviors regarding acceptability. If the values are assessed as having the same importance, then the prediction matched that of Dung’s model, that is, to chose A and C, and A′ and C′, for F1 and F2, respectively. If evidence is evaluated as being more important than procedural legality, then the prediction is choosing B and C, and B′ and C′, in F1 and F2, respectively. And if procedural legality is assessed as more important than evidence, then the prediction is to opt for A and C, and A′ and C′, in F1 and F2, respectively. The results show that the effectiveness of these predictions in the intrasubject analysis is zero, if we regard the exact extension as the only successful prediction. However, it is around 35% (37% in C1 and 33% in C2) if we consider choosing
With respect to biases, some of the participants changed the degree of importance given to at least one value from one frame to the other, which hovers at about 14% in C1 (
Regarding the question about any other value involved that influenced the choice of arguments, and which arguments supported that value, B and B′ were the most mentioned arguments (both conditions considered, 5 mentions in F1 and 7 in F2, respectively) and were associated with values such as responsibility, truth, justice and ethics. Next we have A and A′ (3 mentions in F1 and 4 in F2, resp.), identified with justice, and C and C′ (2 mentions in each frame), linked to law. In the vast majority of cases, influences were recognized as favoring the choice of those arguments, but not the rejection of others.
The results on intersubject argument acceptance are summarized in Table 9. The first and second columns show the percentages of acceptance in each frame and condition. Although the percentage variations could suggest some order effect that causes the acceptance of arguments in the second frame with different strength than in the first one, they are not statistically significant. This could be due to a small size of the sample. In any case, combining the results obtained in both conditions (third column) we obtained similar values between F1 and F2. Moreover, we still observed non-significant differences when combining the results in the first frame of each condition, that is, when participants answered without any previous exercise (last subcolumn), non-significant differences were still observed.12 The comparisons between the frames on the acceptance of arguments by Fischer exact tests gave us the following values (
Variations of argument acceptance among conditions and frames in Experiment #4. Arrows represent the order in which frames were presented in each condition
Experiment #4. Accessions to give more importance to this value than to the other one
Now, a certain interpretation of Table 9 gives us a different picture regarding VAF-based predictions. In the last column, we can see that the arguments B (B′) and C (C′) have similar percentages of acceptance (around 40% on average), above those of A (A´) (around 26%). Taking into account that, on average, evidence is preferred over legality in 49% cases, against a 13% preference for legality over evidence (Table 10), we have a coincidence between the
Another possible explanation13 This was suggested by a reviewer.
In terms of framing, on the one hand, we observed in both conditions that, when moving from one frame to another with an isomorphic structure but contrary conclusions, individuals tend to modulate the strength of argument acceptance and rejection towards a more balanced assessment. However, the statistical differences are not significant, maybe due to the small sample size; hence, we plan to conduct more experiments in the future to address these issues. On the other hand, a tendency to change perceptions about the relative importance of the values involved in the argumentation frames were correlated to some extent with the perception of greater strength in lenient arguments. The data indicate that the value promoted by lenient arguments could be perceived as more important than the value promoted by harsh arguments. This is in line with findings about the influence of leniency bias on mock jury deliberations [39] and the outcome favorability as a strong determinant of individuals’ willingness to accept authoritative decisions [32].
Since [45], the use of empirical methods borrowed from experimental psychology to test the formal semantics of argumentation frameworks has gained some popularity [23,25,30,36,41,46,47,49,50]. In this work we have taken a step further, and applied that methodology to test value-based argumentation frameworks.
The approach had several motivations. One of them arises from the very methodological limitations (basically, inherited from research in non-monotonic reasoning) of building semantics of argumentation frameworks on common intuitions about the solution of a handful of benchmark problems. Extension semantics are not formal semantics in a strict logical sense. In the latter, a semantics defines truth conditions of sentences and a notion of entailment, while extension semantics just “provides a way to select “reasonable” sets of arguments among all the possible ones, according to some criterion embedded in its definition” [5]. In consequence, researchers have adopted ideas of “soundness” just on the basis of the mentioned intuitions. Baroni and Giacomin [6] have also proposed a series of principles or properties with which to evaluate extension semantics, but these are only formalized expressions of the same intuitions. Hence, it could seem legitimate to “advocate the use of psychological experiments as a methodological tool for informing and validating intuitions about argumentation-based reasoning” [45]. In this regard, we agree more with “informing” than with “validating”. For many years now, since the times of [51], cognitive psychology has been accumulating evidence that people deviate from formal, normative models of reasoning. However, although the experiments we reported here are along that line with respect to VAF, we cannot conclude that they undermine or invalidate its value as a normative model. VAFs can still be seen as idealizations of rational audiences that are able to ignore any unspoken, additional elements (values, biases, desires, etc.) in order to decide which arguments are better justified. Moreover, evidence can be used to generate new insights or modify old ones so that the model improves. Another motivation is to adopt a descriptive view and test VAF as a scientific theory to explain and predict human value-based argumentation. In this sense, contrasting with empirical evidence seems more legitimate as a means of validation. Our motivation, in sum, relies on using empirical data to, on the one hand, test VAF as a descriptive theory and, on the other, gain insights to improve it as a normative model. A clear limitation of our study is that we relied on very few scenarios, so we plan to explore more varied argumentative situations in the future to get greater conclusive force.
Experiments #1 and #2 allowed us to determine that people’s argument acceptance deviates from the predictions based on VAF’s semantics and is rather correlated with the importance given to the promoted values, regardless of the perceptions of argument interactions through attacks and defeats. On the other hand, most participants identified incompatibility between a defeated argument and its attacker, which seems to confirm the intuition that defeat presupposes conflict as a necessary condition. This has an interesting consequence regarding VAF. According to the model, if an attack is unsuccessful because the attacked argument promotes some preferred value, then the conflict between those arguments disappears, leading to accept possibly both arguments (see the notion of
Our results also showed that each argument can be perceived as promoting more than one value with different degrees of relative importance. In [27], the authors investigated Bench-Capon and Sartor’s case-based reasoning system [10] in that line, but the negative results on predicting good explanations for legal reasoning suggest the need for more research to fit the model. In the context of VAF, we thought of extending the model to allow representing arguments that promote more than one value by introducing a function
Experiments #3 and #4 evidenced that
Experiment #4 also presented some framing effects. Under similar structural factors, a percentage of the participants tended to vary the degrees of importance given to the values from one frame to the other (F1 and F2), while the arguments with the same position in the respective graphs were accepted with different strength. This may be due to the occurrence of a leniency bias towards the accused, according to the information from the framework. Some psychological findings could explain our results. For instance, McCoun and Kerr [39] showed that in mock juries, given two different decision procedures with various outcomes, such as convict or acquit, people tend to choose the procedure that leads to the benevolent outcome. Then, there could be a similar effect in the face of two structurally identical frameworks but with distinct framings and outcomes, such that people tend to choose different decision procedures (say, extension semantics) according to the bias. In the same way, Esaiasson et al. [32] argued that the tendency to prefer decisions leading to favorable outcomes is usually stronger than the preference for fair procedures. Mercier and Sperber [40], moreover, claimed that skilled arguers are not after the truth but after arguments supporting their views. The authors also argued that participants tend to show biased evaluations, analyzing the arguments contrary to their opinions, in which they look for flaws such as fallacies, and end up finding some. In addition to biases, people hardly evaluate the arguments only with the information offered, but take into account their own information and arguments.
In sum, persuasion depends to a large extent on psychological and informational factors, so the design of a normative model entails the arduous task of discerning which of these factors are in fact necessary for persuasion. In this vein, for example, Bench-Capon, Atkinson and McBurney [18] combined an action-based alternating transition system [1] with VAF to model some game-theory problems (particularly, the dictator and the ultimatum games), and their approach can account for framing effects described in the literature. Depending on the way a problem is described, different arguments are available, leading people to make distinct decisions even though the utility is the same in all frames. Hence, the behavior can be rationalized by analyzing the interaction of the arguments in the model. The work is a good example of how experimental research can provide information and insights to develop practical argumentation models.
