Abstract
Keywords
Introduction
The vast amount of digital data available online provides unprecedented opportunities for automated analyses. For example, text data of all kinds make it possible for researchers in the field of linguistics to employ a bottom-up approach to understand various aspects of language: while the traditional way of manual text investigation involved static corpora, linguists nowadays can analyze text data that reflect global events and ongoing language evolution. The research on specific language phenomena benefits from text data collected from web sources such as online social media (Twitter, Facebook, blogs, forums, etc.). Those texts are typically created by multiple authors who are engaged in discussions or refer to each other’s messages in which they express their thoughts and opinions.
This presents an opportunity for researchers who are interested in stance analysis.
Research on stance includes both theoretical efforts (related to the definition and the knowledge about the nature of this phenomenon) and practical efforts (related to collecting evidence and explaining the means of taking stance), and it can lead to various text analytics applications. The practical tasks require processing large quantities of textual data that are infeasible for manual investigation, for example, providing a temporal overview of stance usage in social media, retrieving the corresponding text data relevant to stance phenomena, or analyzing the occurrences of stance expressions. Therefore, stance researchers are interested in automated ways of text processing that can be offered by researchers from the field of computational linguistics or natural language processing (NLP).
However, many linguists face difficulties when trying to interpret the output of NLP algorithms. For NLP experts, it is equally challenging to gain insight into the underlying text data and to provide useful feedback in order to refine their automatic analyses. In fact, NLP researchers would also benefit from a technique that could improve their understanding of the computational processes associated with the state-of-the-art NLP algorithms (e.g. it is difficult to interpret the state of a large artificial neural network just by weight matrices). This predicament can be resolved by introducing a visual analytics (VA) approach to provide linguistics researchers with interactive visualizations for analyzing large text data and for presenting the NLP experts with feedback at the same time. Our research project StaViCTA (Advances in the description and explanation of Stance in discourse using Visual and Computational Text Analytics (project web page: http://cs.lnu.se/stavicta/)) addresses this challenge and aims to produce a refined theory of stance, efficient interactive visualization, and computational techniques for its analysis, as well as solutions for specific applications. Due to the early stage of research in stance analysis, the project itself follows an iterative progress plan. Therefore, we consider sentiment analysis, including certainty or uncertainty, as underlying aspects of linguistic stance in order to support the construction of the model in general.
In this work, we focus on the exploration of social media documents (in English) and the collection of a training dataset which later will be used to develop appropriate machine learning (ML) approaches. The composed training data consist of text chunks, called

The diagram gives an overview of the underlying research problems from the user perspective. To succeed with the analysis of stance, linguists require means to analyze and interact with the output of NLP algorithms as well as means of further manual investigation. These means are still missing in the analysis loop and are indicated by the red question mark. The dashed edges denote the user operations that depend on the results of interactive visual analysis.
Here, we present our tool called uVSAT that can help stance researchers to identify candidate documents that may contain stance expressions, analyze the document texts, and export the new stance markers (as introduced in our previous poster abstract 1 ). uVSAT supports the research task of how we can study the use and patterns of stance meanings and stance expressions in human communication over time in order to investigate what stance markers and stance markings are used when, why, how, where, and in what type of dialogic sequences related to the contexts where they occur. Our effort described in this article is meant to complement the existing techniques for stance analysis based on manual close reading and traditional linguistic tools by introducing a VA approach to this problem, while not providing a completely automatic stance analysis yet. The main contributions of the VA approach presented in this article include the following:
A web-based
An
Interactive
The remainder of this article is organized as follows: the next section provides the background of stance analysis from the perspective of linguistics and NLP. The subsequent section covers the related work in text visualization, including work dedicated to sentiment analysis visualization. After this, we explain the system architecture and data model as well as user tasks supported by uVSAT. Then, we describe in detail our visualization and interaction approaches for this tool. The subsequent section discusses a use case from the linguistics domain based on exploration of data with regard to anger sentiment as a subcategory of stance. The penultimate section provides the results of a domain expert review and our reflections about the tool. Finally, we summarize the contributions and future work in the last section.
Background
Our research on visual stance analytics is by nature tightly connected to the domains of linguistics and NLP. Since the problem of stance analysis is not widely discussed in the VA community (as opposed to sentiment analysis), we present the theoretical background of stance and its relation to sentiment in this section.
Stance and sentiment model
Stance is a topical area of interest in linguistics because the interactive nature of communication between individuals is considered vital. The function of taking stance in the communicative situation is to convey the speaker’s viewpoint of what is talked about and to regulate the exchange between the dialog partners. Communication here works on more than the pure understanding of words. Words are always understood in the light of the contexts and the situations where they are used.2,3 In doing so, language is used to recontextualize human experiences into written and spoken forms. Its social role is to affect the state of mind of other people and to negotiate meanings in order to bring about cognitive changes.4,5 Language users construe their expressions to communicate their particular perspective and viewpoint of what is talked about. As the following scheme 6 demonstrates, this process of taking stance is evaluative and fundamentally interactional, a type of ongoing negotiation:
An utterance proposed by X;
Y’s engagement (mental processing or interpretation or positioning) as to the utterance in context;
Y’s response to X’s utterance;
X’s engagement (mental processing or interpretation or positioning) as to the utterance in context;
X’s response to Y’s utterance;
Repeat 2–6.
Ours is a broad understanding of the process of taking stance, as it is critical to address the subtle but important differences in how people create discourse—imbuing it with their personal word choices as distinct acts of taking stance. This encompasses expressions of subjectivity, ranging from individual words to larger chunks of text. These items express speaker’s (1) sentiments, (2) attitudes, and (3) beliefs, covering meanings of certainty, volition, evidence, emotion, valence, degree, and so on. Following Du Bois, 7 we divide the process of taking stance into three parts: (1) speaker evaluation of what is talked about, (2) speaker positioning (epistemicity), and (3) alignment in communication, that is, establishment of agreement or disagreement. Stance has been studied under different headings and scope, such as evaluation,8,9 sentiment, 10 and appraisal, 11 and, of course, under the title stance itself.5,12–14 Yet, at the present time, there is no conclusive and universally accepted definition of linguistic stance.
As stated above, subcategories of stance include sentiment, certainty or uncertainty, as well as other subcategories that are not well-defined yet. For this article, we have limited the scope of our understanding of stance to sentiment and certainty or uncertainty. These subcategories are generally considered to describe the feelings and assessments of an utterance; as such, they can encapsulate an evaluative statement that is deemed to be a stance act. Our approach is based on the expectation that the occurrences of such expressions lead to occurrences of other stance expressions—we denote the particular analyzed subcategories by
Sentiment analysis
From an operational point of view, stance includes phenomena such as subjectivity, sentiment, belief, trust, and uncertainty. Some of these phenomena, such as sentiment and subjectivity, have enjoyed considerable attention in the NLP community (for instance, see the works of Pang and Lee, 15 Liu, 16 and Lin et al. 17 ), while others, such as belief, trust, or uncertainty, have remained comparatively peripheral (but there is a number of efforts18,19 to analyze uncertainty and speculation, respectively). Sentiment analysis in particular has become a staple in NLP, both in research and in commercial applications, with a large number of vendors offering solutions for social media monitoring where sentiment analysis is an important part of the analytics suite.
As with any research area that gains popularity in a research community, there has been a wide variety of approaches suggested in the literature. Examples range from simple keyword matching 20 over standard machine learning techniques15,21 to the use of topic modeling algorithms and latent variable models22–24 to deep learning architectures.25,26 State-of-the-art approaches to sentiment analysis now approach, and in some cases even exceed, 90% accuracy on standardized benchmark test suites.21,27,28
Sentiment analysis is normally considered as a classification problem over two or three classes, where
As opposed to some of more complex approaches based on ML, we opt for a simplistic approach to sentiment classification for the purposes of the visualization tool in order to preserve transparency and simplicity. As previously noted, we have chosen to address stance through subcategories. More specifically in uVSAT, these are based on Ekman’s Big Six emotions, employing the NLP solution of simple lexical matching over lists of attitude terms (which we call stance markers as already mentioned in the “Introduction” section). The main goal at this stage of the project is to facilitate experiments to further improve our understanding of stance in general and our analysis techniques in particular. While our method of sentiment analysis is simple, such a lexical-based approach is still widely used by visualization and VA solutions,33,34 especially the ones aiming for high performance when processing large amounts of input data. 35 There are also several examples of combining both lexical-based and machine learning–based approaches for sentiment analysis that reports similar 36 or even surprisingly good 37 results when using the lexical approach.
Related work
Our tool uVSAT was designed to visualize and interact with large text data sources as well as the results of automatic text processing which include time-series. There have recently been multiple works dedicated to text visualization and analytics of social media. Survey articles by Alencar et al., 38 Gan et al., 39 Kerren et al., 40 and Kucher and Kerren 41 demonstrate a variety of techniques used for the visualization of single documents, document collections (corpora), and text-related data streams. In this section, we will discuss several groups of works relevant to our research from various aspects.
Time-dependent text visualization
A good number of such works address temporal aspects to visualize events, topic competition or evolution, or other time-dependent data. While some of them introduce novel metaphors for visual encoding, multiple techniques combine well-known representations such as line plots, river metaphors, or animated force-directed graphs. Havre et al. 42 introduce ThemeRiver, the original technique for temporal data visualization based on a river metaphor that is designed to depict topic evolution in document collections. Dou et al. 43 combine trees, text tags, and rivers in their HierarchicalTopics system to visualize the temporal evolution of topics in corpora. Xu et al. 44 combine line plots, stacked charts, and word clouds to depict topic competition in social media document collections. To support the real-time monitoring of streaming Twitter data backed up with automatic text classification, Bosch et al. 45 use timeline, word clouds, glyphs, and maps in the ScatterBlogs2 system. For the work in this article, we decided to choose simple visual representations (line plots, text tags, and bubble charts) for the data currently available to us, although we plan to design more specialized visual encodings for other tasks in the future.
Sentiment visualization
While specific problems (and the corresponding analysis techniques) such as topic modeling and event detection have been very popular in text visualization, the interest for sentiment analysis and visualization is also arising in the VA community. Liu et al.
46
and Oelke et al.
47
describe visualizations for opinion mining of reviews. Wanner et al.,
48
Cui et al.,
49
and Rohrdantz et al.
50
present approaches for visual sentiment analysis that supports temporal data. Görg et al.
37
describe the fluid integration of sentiment analysis as well as other computational text analyses with interactive visualizations in their system Jigsaw. Online social media data are used for visual sentiment analysis by Wanner et al.,
51
Zhang et al.,
52
and Hao et al.
53
SentiView, introduced by Wang et al.,
54
not only facilitates temporal sentiment analysis but also augments it with relation analysis based on graph representation—this is relevant to our long-term research goals involving intersubjectivity and stance analysis. The recent work of Zhao et al.
33
describes PEARL, a VA system for multidimensional personal emotion or sentiment visualization of Twitter posts over time, and uses an approach similar to ours (based on lexical matching of
Visualization for linguistic research
InfoVis and VA techniques have been used to facilitate tasks such as the analysis of corpora (e.g. Compus by Fekete and Dufournaud, 55 CorpusSeparator by Correll et al., 56 Text Variation Explorer by Siirtola et al., 57 and those techniques proposed by Regan and Becker 58 ), the analysis of relations or reuse (e.g. ShakerVis by Geng et al. 59 and techniques proposed by Jänicke et al. 60 ), and lexical analysis (e.g. the study by Rohrdantz et al. 61 ). An additional category of tasks that is worthy of mention is related to semantics: while numerous text visualization techniques use topic modeling, experts in computational linguistics use visualization to facilitate their research on this subject. For instance, Kabán and Girolami 62 visualize their own model of dynamically evolving text collections. Another task related to stance analysis is discourse analysis. Existing work on visualization of discourse includes the graph-based approach by Brandes and Corman, 63 Conceptual Recurrence Plots by Angus et al., 64 and several recent works that focus on discourse in online social media: Lingoscope by Diakopoulos et al. 65 or ConVis by Hoque and Carenini. 66
VA for sentiment research
Finally, the work that is most relevant to our approach in this article is dedicated to sentiment visualization which facilitates the research on sentiment for linguists. Gregory et al.
67
conduct visual sentiment analysis of document collection with regard to
To the best of our knowledge, the problem of stance analysis and visualization has not been addressed by work in VA or information visualization. Therefore, we would like to raise the awareness of the InfoVis and VA communities in this article by building on the discussed work in text visualization for sentiment analysis and existing work on visual text analytics for linguists.
Overall architecture and data
Before we can discuss the overall architecture of our VA approach, we have to briefly present the different members of the StaViCTA project in order to motivate our designs. The visualization group at the Department of Computer Science, Linnaeus University, is responsible for VA research and the development of the VA approaches needed in the project and presented in this work. A domain expert group in linguistics at the Centre for Languages and Literature, Lund University, is in charge of task identification, stance theory construction, evaluation, and so on. Finally, a group at the company Gavagai has broad knowledge in NLP and develops automatic analysis techniques and tools for the project. Gavagai monitors and processes online media (e.g. newswire, weblogs, forums, and social media such as Twitter and Facebook) for media monitoring and text analytics purposes.
System architecture and workflow
Figure 2 displays the overall architecture of uVSAT that is implemented as a web application. The back-end consists of a (visualization) server application implemented in Java that communicates with the Gavagai computing server, fetches the HTML content from URI links, processes the text data, and communicates the results in JSON format to the client(s). The front-end is implemented in JavaScript with D3 69 and Rickshaw 70 libraries, and it only requires a modern web browser. While the major and cost-intensive computational analyses are processed by the Gavagai and visualization servers, several minor analyses (which do not require intense computations for large amounts of data) are implemented on the client side.

The architecture of uVSAT comprises front-end and back-end tiers that communicate with external servers.
Data model
uVSAT has been designed to use time-series data from external providers through a RESTful API, 71 as well as to fetch and process corresponding HTML data from respective web servers. Currently, we use time-series data only from our collaboration partners at Gavagai (although we plan to support other data sources in the future). Gavagai analyzes text data from multiple sources, but for the purposes of the system presented in this article, they use the data fetched from various blogs and forums.
As mentioned in the “Background” section, we focus on the simplest possible type of stance analysis, that is, counting the occurrences of sentiment terms in documents that mention specific target terms. This simple approach allows our partners to support the analysis of large amounts of text data, up to 15 million documents per day. Here, a
To detect documents associated with stance, we consider specific markers relevant to sentiment and (un)certainty from several available sources (WordNet-Affect,
72
GeneralInquirer,
73
and Compass DeRose
74
), while refining those marker lists is one of the purposes of uVSAT (since the sources above do not differentiate stance from sentiment, etc.). Our choice of analyzed stance types (also denoted by
As an example, I am so sick of people who sell such rifles and so sick of people who buy this distasteful weapon.
contains two occurrences of the stance marker “sick of” and one occurrence of “distasteful,” generating a
To summarize the description of
The occurrence counts are aggregated for each target–observer combination (
The Gavagai API also provides URIs to the documents used to calculate the polarization values (taking (
Requirement analysis
After the introduction of the fundamentals and research gaps of visual stance analytics including a short discussion of the origin and structure of available datasets, we are able to take a closer look at the actual analysis challenges and most important tasks that uVSAT should address. They are based on extensive discussions with our collaboration partners in linguistics and computer linguistics.
Analysis challenges
We have designed uVSAT to facilitate users with answering the following questions:
Analytical tasks
These questions and problems can be mapped to the following categories of high-level (analytical) tasks:
In the following section, we discuss our visualization approach in detail, justify the design decisions, and refer back to the above-listed research questions and tasks.
Visualization approach
The graphical user interface (GUI) of our tool offers a tab-oriented design with two types of tabs (cf. Figures 3 and 4): a single timeline view tab that is used to work with an arbitrary number of timeline plots, and multiple document view tabs that are opened by the user when fetching the document URIs for selected time intervals. As the timeline view is the entry point of all visual analyses supported by our approach, we start our discussion with this view.

The screenshot of our tool shows the

The screenshot displays our
Timeline view
The timeline view tab (cf. Figure 3) provides the users with the interfaces for exploring time-series data for selected targets or observers and specified time intervals. Note that fetching the input data to be analyzed—that is, the initial selection of specific targets, observers, and time ranges—from the Gavagai server is done via a simple dialog box as explained in our use case (cf. the corresponding section). In this section, we concentrate on overall design aspects including visual representation and interaction possibilities.
Color coding considerations
Before we address the particular representations, we have to explain the color coding scheme used for the timeline view as well as document views. As mentioned in subsection “Data model,” the analyses supported by our tool involve the combinations of targets
The analyses employed by document views (see the corresponding subsection below) concentrate on the observers, that is, stance types, and do not differentiate between observers related to various targets. This had an implication that the color coding for document views was initially based on ColorBrewer, 76 and it contained separate colors for observers and targets.
Afterward, we have changed the color coding used for the timeline view in accordance to the TreeColors approach.
77
To generate the colors, we have
Data hierarchy view
After the input data have been loaded, the users are provided with the data hierarchy view displayed in Figure 3(a) that shows the hierarchical structure of the available target–observer combinations. Users can also open a tab with iconic “overview plots” (cf. Figure 10) for
Timeline plots
uVSAT uses a standard line plot representation for time-series data (cf. Figure 3(b)) and supports usual interaction techniques for such plots (research question Q1). We have chosen this visual representation as our domain experts are already familiar with it. In addition, line plots can be easily extended with additional graphical features. Details on hover, plot overview, and scroll and zoom are provided by default by the Rickshaw component. Users are also able to filter the plots with regard to visible target–observer combinations by switching on and off the corresponding labels. Our tool supports multiple plots displayed on the same canvas (users can drag-and-drop additional items from the data hierarchy view) or separately (users can drag the plot containers to change the timeline view layout). For the comparison of several plots displayed side by side, users can control the automatic vertical scaling—by default, plots are scaled to fit the containers. This functionality was explicitly wished by our domain experts.
ROI highlighting
To facilitate the search for ROIs, our tool also supports automatic ROI highlighting (research question Q2). Currently, we use a basic ad hoc algorithm for marking the ROIs based on outlier or differential analysis. As a first step of the algorithm, time-series points
Since the source time-series data are in general noisy,
ROIs are highlighted by thick line segments (cf. Figure 3(b)). The algorithm parameters
Trend analysis
Users have several options of conducting trend analyses over selected time intervals for specified observers (cf. Figure 3(c)). uVSAT supports linear and quadratic time-series trend analysis based on polynomial regression (calculated with the ordinary least squares (OLS) method). We implemented two variations: one can choose to either render trends as overlay plots (cf. Figure 5(a)) or to substitute selected timeline plot segments with trend lines (cf. Figure 5(b)) to reduce the visual complexity of the displayed data (research questions Q1 and Q2). Trend lines are easily distinguishable by the use of dashed lines. Even information about the predicted value change at the current trend rate and a button for removing trend lines are available on hover.

Trends can be displayed as either (a) overlay plots or (b) instead of original plot segments.
Document URI links queries
As soon as the user is more interested in the concrete documents whose frequencies are represented by the different time plots, he or she can select time intervals for specific sets of observers and load the corresponding URI links to the documents (research question Q3). In this case, a new document view tab is created and a thumbnail of the line plot used for the query is displayed in this new view in order to preserve the mental map. An example of this thumbnail can be seen in Figure 4 in the left upper corner.
History diagram
Since the workflow of uVSAT involves multiple document view tabs that also may be closed by a user during the analysis process, the need for overview and control of such user actions arises. Our interactive history diagram (cf. Figure 3(d) and Figure 6) provides an overview of the document URI queries sequence, their results, and relations to each other (research questions Q6 and Q7).

The history diagram allows users to keep track of document queries and navigate between interface states.
In this diagram that supports the so-called analysis provenance, 78 nodes represent URI queries and edges represent the detected relations between corresponding query results (this partially resembles the visualization approach described by Cernea et al. 79 ). The size of every node is proportional to the number of URI links retrieved for the corresponding query. Nodes are represented by glyphs similar to pie charts (although only qualitative information about relevant observers is used), following the same color coding of observers as the timeline plots. The currently selected node is highlighted in yellow. Since the diagram is used for history navigation, it also contains a dedicated node (depicted by a triangle) that represents the up-to-date interface state. Edges connect only nodes whose query results contain common subsets of URI links. The size of common subsets (i.e. Jaccard similarity of link sets 80 ) is mapped to edge opacity, thickness, or both of these attributes (selected as a user setting). The layout of the history diagram is based on arc diagrams by Wattenberg: 81 nodes are simply aligned along a horizontal axis in the order of corresponding queries, and edges are rendered as curved arcs. We apply a random-order greedy heuristic described by He et al. 82 to decrease the number of edge crossings when allocating edges to the upper or lower part of the drawing.
The interactive history covers the following functionalities: every time a user issues a URI links query that leads to the creation of a new document view tab, the state of this new tab and the timeline view tab are saved and a corresponding node is added to the history diagram. When the user clicks on a history node, the timeline view tab state is restored, a document view tab with corresponding state is either created or brought into focus (if currently present), and the user actions temporarily stop affecting the history state (e.g. issuing a new query will not add the resulting state to history)—we have chosen such behavior to keep the history sequential. When the user clicks on the triangle, the previously saved up-to-date state is restored. Under circumstances, this can lead to some document view tabs getting closed.
Document views
A document view tab (cf. Figure 4) basically consists of two areas. The left (smaller) area provides information about all documents fetched based on the selection described at the end of subsection “Timeline view.” Thus, it shows the aforementioned line plot thumbnail used for the query as well as a link list (cf. Figure 4(a)) to HTML documents (blog posts, forum messages, etc.) that were marked as associated with a specific target–observer combination. Users can filter the list by URI domain and sort it by the timestamp value or by polarization value (as reported by the Gavagai server). Polarization values are also used for the color coding of list entries (research question Q3).
By selecting a link from the list, the corresponding document content is fetched, processed at the (visualization) server side, and rendered at the client side. If the content is not available at this time, the corresponding list entry is marked. The document data at this stage are raw HTML which affects the analysis. This is because the source code comments and metadata (such as keywords) often contain text irrelevant to the document content. To direct the user’s focus on textual document data, uVSAT renders the HTML content as plain text by using the Jericho library. 83 All data and analysis results related to the single focus document are shown in the second area on the right-hand side of the document list. This area integrates four subviews: the current document view, the current document details view (not further discussed here), the document marker view, and the current document overview.
It should be noted that uVSAT also provides an opportunity to copy the query link for a given document view tab and to use it in later analysis sessions by opening a tab with identical contents (research question Q6).
Current document view
Figure 4(b) displays the text representation of a document. The stance markers and target terms are highlighted and support brushing in coordination with the other views (research question Q4). The motivation for the color coding for document view tabs was described above: it uses a scheme with eight colors for stance markers and a separate scheme with five colors based on ColorBrewer for target terms since targets share stance markers associated with observers (types of stance), for example, the word “commendable” is a marker of
Document marker view
Information about stance markers (and their occurrence counts) as well as target terms detected in the current document is summarized in the document marker view (cf. Figure 4(c)). The stance markers for each observer are sorted by their counts to facilitate user investigations (note that target terms occurrences do not affect the statistics since such terms are not directly related to expressions of stance). The users can navigate the document with regard to markers or terms occurrences and to filter them (research question Q4).
Current document overview
To give users an overview of marker or term distributions in the current document (and an additional means of navigation), uVSAT provides several visual representations displayed in Figure 4(d). First of all, a two-dimensional (2D) overview is visualized by mapping the current positions of all markers or terms onto a canvas (they are represented by circles and diamonds, respectively). The current viewport is displayed as a rectangle. This overview supports navigation by clicking on a plot item or the canvas. Additionally, a separate one-dimensional (1D) overview for each observer and target is visualized by projecting the positions of corresponding markers or terms onto a vertical axis. Such overviews help the users to immediately perceive the distributions over the document length since the 2D overview can become cluttered in case of numerous markers or terms. 1D overviews support document navigation by clicking on plot items. Seeing such distributions is especially interesting for our domain experts because it is important for a better understanding of stance in discourse (research question Q4), for instance, if a marker for a specific stance type mostly occurs in the context of another marker.
Aggregation charts
While the techniques discussed above allow the users to analyze a selected document in detail and provide an indication of interesting documents (by polarization values), the document sets retrieved for certain queries may contain thousands of documents, and the users will benefit from a method that helps them to select documents that are interesting for further stance marker investigation (research question Q5). uVSAT addresses this problem with a technique that we call

Aggregation charts

Aggregation charts
The visual representation is based on basic bubble charts described by Viégas et al. 84 Every item in the chart represents a single document which corresponds to the target; the color coding is based on the nominal target values. A single item is visually represented by a glyph consisting of two nested circles. The size of the outer circle is proportional to the total number of corresponding stance markers detected in the document, and the size of the inner circle (filled with a more saturated color) is proportional to the number of unique marker types detected in the document. For instance, a document with 100 occurrences of a marker “good” and 100 occurrences of a marker “bad” has only two unique marker types: “good” and “bad.”
The aggregated data used for these charts can be organized in two ways: by observer and by stance marker. In the former case, a separate chart is visualized for each observer associated with the document set. In the latter case, one individual chart is visualized for each unique marker type (belonging to present observers) that has been detected in at least one document.
Figures 7 and 8 display examples of aggregation charts visualized for a document set based on 1517 URIs retrieved for the target–observer combinations
Aggregation charts facilitate the quick perception of the distribution of observers or stance markers in all documents, the identification of documents with a large number of stance markers or unique marker types, the navigation to such documents, and the analysis of document properties concerning other observers or stance markers (by brushing the corresponding chart item).
Marker and document export
One aim of our visualization tool is to identify and collect relevant stance markers from a larger number of analyzed documents (research question Q8). uVSAT supports the export of new stance markers from document view tabs by selecting a portion of text in the current document view (depicted in Figure 4(c)), assigning it with arbitrary tags, and exporting it to a JSON file. This approach allows us to collect a dataset of stance markers not restricted by the categories currently used for observers. Moreover, we are able not only to collect stance markers as short phrases (1-grams, 85 2-grams, or similar) but also to collect larger utterances which provide context for stance analysis.
Our tool also supports the export of currently viewed documents and aggregation charts as static HTML pages. In the former case, the document view with highlighted stance markers and target terms, document details, hierarchical markers view, and document overview (essentially, all the data pertaining to the current document on a document view tab) are exported. In the latter case, all aggregation charts that are currently available are exported together with the corresponding document set query (used observers, selected time interval, etc.). This feature allows users to store static data for further manual investigation or referencing, which can be especially helpful for researchers in linguistics.
Use case: linguistics research
The use case described here is one in which a linguist has chosen to analyze negative sentiments of stance (focusing on
For performing an accurate analysis, data revealing information about the communicative forces and the attitudes to the ideas discussed at different points in time as well as possible relationships between those attitudes must be made available to the researcher. By using uVSAT, the linguist is able to analyze these aspects of the social media data which would be impossible for manual stance analysis.
Timeline data analysis
First, the researcher uses the

The dialog box used to select the time intervals and target–observer combinations to load time-series data. Note that there are additional observer types (
By viewing the

Part of the timeline overview: the plots for observers are ordered by mean value in descending order,
The researcher immediately notices the spike of activity on multiple plots around early hours of 3 February CET, which corresponds to the late evening of 2 February EST—the time when the advertisement was aired in the United States (aim A1).
Then, the researcher creates timeline plots by dragging-and-dropping the observer items onto the

Timeline view: four observers for target
Identifying the document of interest
The resulting URI set comprises 3424 document links. While the researcher could explore this dataset manually, it would take a significant amount of time to achieve aim A2. At this point, the researcher decides to build the aggregation charts for the current document set and to investigate the charts organized by observer. For this, the text document data are fetched from respective web servers and processed by uVSAT.
The aggregation chart for

The aggregation chart for
The loaded document of interest (depicted in Figure 13) is a blog post
87
with a heated discussion in commentaries. To concentrate on the analysis of

Document view for a selected document with majority of stance markers filtered out. Besides the
Identifying the markers of anger
The aggregation charts for the current document set can be organized by stance marker instead of observer. The researcher selects this option and explores the resulting set of 605 aggregation charts (one per each unique stance marker type). Since the charts are ordered by marker occurrences number in descending order, the researcher quickly identifies several most frequent markers of
Stance markers of
The most frequently used stance markers of
Final document analysis
After identifying the most frequent markers of
Stance markers of
The number of occurrences and ranks of the previously identified stance markers of
The researcher reviews the current document overview once more (cf. Figure 14) and concludes that the identified markers are also distributed throughout this document. As the observer

The overview for the previously selected document with only five marker types of
Summary
By using uVSAT, the researcher has been able to achieve his or her analysis aims, that is, exploring the data related to the case, analyzing the stance-related phenomena of
On a final note, the linguist began with one specific study area. After using uVSAT, the researcher concluded that the data have also revealed three other possible areas of interest: (1) directionality and frequency of the
Expert reviews and discussion
In this section, we present the results of two domain expert reviews as well as performance issues. Based on these findings, we discuss some lessons learned during the development and testing phase of uVSAT.
Domain expert reviews
For the time being, our research partners at Lund University have been the primary users of uVSAT. They are familiar with standard tools for corpus analysis (e.g. AntConc, BYU-BNC, WORDSMITH, or Google Ngram Viewer) and manual text analysis. As a kind of project preparation, we introduced basic visualization concepts and techniques to them at the beginning of our collaboration. Their suggestions and feedback during the design and development stage of uVSAT are summarized in the following with regard to general analysis workflow, visualization and interaction techniques, and possible improvements for the tool.
General analysis workflow
The experts have been very enthusiastic about the opportunity to analyze a large number of online social media documents in detail with regard to stance and sentiment in an interactive way. They have noted that their usual tools of choice in most cases require text preprocessing and employ static or rarely updated corpora, as opposed to our approach:
The uVSAT tool can accommodate the time factor and help the analyst sift through large amounts of data where important chunks could easily be overlooked. Using the uVSAT tool, which is visually driven to reveal patterns, the researcher can track these and follow how language is being shaped by current digital communications.
The experts have also appreciated the fact that uVSAT is implemented as a web application which does not require a specific OS or installation or update procedures.
Interactive visualization approach
The feedback on the design of both timeline and document views has been positive. The experts have approved of the features facilitating the time-series analysis, in particular, they have liked that ROI highlighting is turned on by default. The experts have commended the usage of color coding to highlight the ROIs as well as the markers or terms. They have also approved our decision to convert HTML documents into plain text in order to concentrate on the text content in the document view tabs. The experts have also been very positive about the aggregation charts as a means of overview, pattern detection, and navigation:
Aggregation charts give extremely comprehensive views that are easily understood by this user. These images result in giving the researcher a direct visual confirmation of the number of markers, which then can be scrolled through, chosen and loaded.
The ability to export stance markers as well the content for further manual investigation was also commented on:
This gives the user a pro-active involvement in the ongoing improvement of the tool that is neither confusing nor time-consuming.
Possible improvements
One of the experts’ suggestions during the development was related to the comparison of several timeline plots. We have addressed it by providing an ability to control the layout of the timeline view and to disable the automatic vertical scaling which allows the user to compare the plots situated side by side. The feedback also included some complaints related to the tool performance (see below in the next subsection) as well as a wish for additional functionality related to document set overview (e.g. clustering the documents in aggregation charts by the URL domain). We have also learned that the trend analysis feature is only rarely used since it currently focuses on already-available time-series data—therefore, we are planning to extend this feature by supporting predictive trend analysis to increase its level of utility.
Summary
The experts have stated that uVSAT is a useful addition into their arsenal of stance analysis techniques. They are using it to explore and analyze the social media data and complement it with manual stance analysis as well as by processing the exported data with other software tools, for example, for concordance analysis. They have also started to collect the ML training dataset, thus achieving the general design goals. In general, the domain experts have concluded the following:
For a linguist, uVSAT is a viable tool for working with stance analysis.
Performance and scalability
In this subsection, we discuss certain aspects that affect the user experience when trying to apply uVSAT for the analysis of rather large datasets: data transmission delays, data processing delays, and user interface responsiveness.
We currently store neither time-series data nor document text data on our visualization server. Hence, uVSAT issues request for time-series data, URIs, and HTML content from external servers on demand. This leads to delays while retrieving the source data. Additional delays occur while transmitting the data between the front-end and back-end components and, finally, while processing the data at the server side.
We address the networking delay by conducting some types of analyses (such as ROI highlighting or trend computations) on the client side. It currently seems, although, that the performance bottleneck is the step of fetching the HTML content from numerous external servers which may have varying connection speed, performance, access frequency limitations, and even availability. We plan to introduce a local database for caching the external data (as well as some processing results), although it can lead to validity concerns (see subsection “Lessons learned”).
As for the UI responsiveness: D3 and Rickshaw use SVG for rendering which may require significant computational resources (and leads to UI lags). On a 2013 MacBook Pro computer with Intel Core i7 processor (2.3 GHz), sensible UI delays start to occur when re-rendering plots with a total of about 3000 points. This is partially addressed with a style of workflow involving preliminary analysis of time-series overview and focusing on selected time intervals.
Lessons learned
Our current visualization approach involves multiple coordinated views based on standard representations. Its main advantage (as opposed to a more complex integrated view) is the ease of user adoption: the primary users of our tool are researchers in linguistics who do not tolerate abundant details or unintuitive visual representations. The corresponding disadvantage, however, is the necessity of large display area to lay out all the views in sufficient size. We plan to address this issue in the future by developing novel visual representations for stance-related and time-dependent text data, having the domain particularities in mind.
The fact that our source data originate in online social media also has certain consequences: the text documents may be edited or deleted at any time. This presents us with a trade-off between data validity and performance. By fetching online data on user’s demand (as uVSAT currently does), every document is analyzed in its up-to-date state (or it is marked as unavailable), but it requires computational resources (and it is also related to inevitable networking delays). Otherwise, if the data are cached while the original data are modified, it would invalidate the detailed analysis of document contents. To address this issue, we plan to involve uncertainty tackling techniques. Another possibility would involve storing the versioned source documents—while in practice, it would require significant resources, in theory, it could provide an analysis opportunity with regard to additional temporal dimension.
Conclusion and future work
In this article, we have introduced the problem of stance analysis of online social media texts that requires a joint multidisciplinary effort of researchers in linguistics, NLP, and VA. We have described an analysis approach for stance analysis based on sentiment or certainty considerations and presented our tool uVSAT for visual stance analysis that supports the interactive exploration of time-series data associated with online social media documents, including the text content of such documents. While uVSAT does not provide completely automatic stance analysis, it facilitates the linguists by complementing manual stance analysis of text documents based on close reading with a VA approach that allows the researchers to use massive datasets originating from social media.
The contributions of this article include the description of a VA tool that contains multiple approaches for analyzing temporal and textual data as well as exporting stance markers in order to prepare a stance-oriented training dataset. We also presented special visualization techniques developed for our tool: the history diagram (for document set query analysis provenance) and the aggregation charts (for document set overview, navigation, and comparison).
We already used uVSAT for the purposes of the StaViCTA project, and we provided feedback from the linguistics experts in this article. Using uVSAT, our researchers in linguistics have been able to collect stance markers that are now being used to define stance categories other than sentiment and certainty or uncertainty (e.g. concessions and judgment). The tool is currently being used for collecting documents that form the training dataset for our researchers in NLP as well as for actual stance analysis conducted by the linguists. We are convinced that our tool will be useful for other interested researchers.
Future work includes additional overview and navigation techniques for document sets, support for local database caching, streaming data, uncertainty tackling (with regard to missing time-series data as well as unavailable web documents), and arbitrary time-series data sources. In order to provide our tool to others, we will develop our own (more lightweight) analysis engine to become independent from Gavagai. We also plan to conduct a larger study to evaluate the effectiveness of single techniques such as history diagram and aggregation charts.
