Abstract
Introduction
The objective of the 2011 i2b2 Challenge in Natural Language Processing for Clinical Data–-Track 2 on Sentiment Classification 1 was to evaluate the performance of natural language processing systems in classifying sentences in suicide notes using a scheme of 15 topics, mostly emotions. The organizers provided the classification scheme and a manually annotated training dataset consisting of 600 suicide notes, where each sentence was associated with zero or more topics from the classification scheme. We assembled a set of lexicons and pattern–matching rules to support feature extraction, which complemented a set of public resources including WordNet 2 and four emotive lexicons3–6 used for the same purpose. The features extracted from the training dataset were used to train a naïve Bayes classifier. The machine learning module of the system was integrated with a rule–based component, which used a set of pattern–matching rules to annotate sentences with associated topics.
System Overview
Figure 1 illustrates the conceptual architecture of the proposed topic classification approach, consisting of five basic steps: linguistic pre-processing, feature extraction based on lexicon and pattern matching, dimensionality reduction based on principal component analysis, naïve Bayes classification and rule-based classification based on pattern matching. Input document were supplied as plain ASCII text files with one sentence per line and each token separated by blank spaces. Each sentence was linguistically preprocessed, including part-of-speech (POS) tagging, 7 lemmatization 8 and spelling correction. 9 The results of such processing were stored in a relational database 10 for easy access and further semantic reasoning. Prior to its distribution, each sentence in the training set was manually annotated with the associated topics. These annotations were stored together with the text data to be used later to train a machine learning classifier. Lexico-semantic properties of individual tokens as well as phrases matching pre-specified lexicosyntactic patterns were aggregated for each sentence in order to map it to its feature vector. All feature vectors were exported into an ARFF (Attribute-Relation File Format) file for use with the Weka machine learning software. 11

Conceptual architecture of the proposed topic classification approach.
Using Weka, the training ARFF file with feature vectors mapped to topics first underwent the principal component analysis (PCA) to transform the original features into combinations of possibly correlated ones (called principal components), which account for most of the variance in the training data. 12 As a result, the dimensionality of the data space was reduced from 353 original features to 245 principal components. The transformed data was again saved as an ARFF file and used to train a naïve Bayes classifier. In particular, a discriminative multinominal naïve Bayes classifier was chosen from other varieties offered in Weka as it provided the best results during cross-validation. The naïve Bayes model produced was used in the testing phase to classify sentences from the test dataset. Each test sentence was classified using a single topic with the highest probability (according to the given model) over a set threshold. Different threshold values were used to produce the three test runs submitted, with lower threshold values used to increase the recall and conversely higher threshold values used to increase the precision. A single most probable class was suggested in an attempt to maintain the precision of the method.
Additional classifications were produced using a set of pattern-matching rules extracted manually from the training data. The key criterion in defining these rules was to match multiple true positives while minimizing the number of false positives. The key role of these rules was to improve the recall following the naïve Bayes classification without compromising the precision.
Lexico-Semantic Resources
The majority of features we used combined lexico-semantic properties of individual tokens. These properties were obtained by mapping the tokens to a set of public lexico-semantic resources as well as the internally assembled lexicons. The most general lexical resource used was the WordNet, 2 a large lexical database with content words (ie, nouns, verbs, adjectives and adverbs) grouped into synsets (ie, sets of synonyms), which are mapped to 44 lexical domains and interlinked with 13 sense relations.
The rest of the public resources used were all emotive lexicons, in which words (or phrases) are associated with their positive/negative polarity or mapped to specific emotions. WordNet-Affect 3 maps >900 WordNet synsets to affective concepts they explicitly express. The emotional concepts represented in WordNet-Affect are organized into a hierarchy of 306 nodes, including different polarities represented by positive, negative and neutral emotion classifications within the hierarchy. Similarly, SentiWordNet 4 maps >200K WordNet synsets to three numerical scores, describing how objective, positive and negative the meaning associated with a synset is. The remaining two emotive lexicons used in our approach simply classify words or multi-word expressions as positive or negative, without any further structural or quantitative properties attached to them. We used a lexicon of >6K positive and negative opinion or sentiment words. 5 Similarly, the Macquarie Semantic Orientation Lexicon (MSOL) 6 contains >70K expressions that explicitly convey positive or negative meaning or are implicitly associated with such meaning based on the polarity of words found in their thesaurus paragraph.
Apart from emotive lexicons, which are used to identify explicit mentions of certain emotions
3
or determine the positive/negative polarity of expressions used,4–6 other types of information may be useful to classify an emotion implied by a specific sentence. For instance, we observed that
Semantic categories with examples taken from the corresponding lexicons.
We created no explicit links between these semantic categories and topics from the classification scheme. Instead, any correlations present in the training data were automatically identified as part of the naïve Bayes model created in the training phase. Apart from lexically modeling specific semantic categories, we created 15 other lexicons, one for each topic from the classification scheme, which contained useful lexical clues for the given topic, ie, words describing various semantic types, but all of them related to specific topic (see Table 2). These words were chosen based on the human judgment of their relevance.
Topics and examples of their lexical clues.
In addition to manually curated lexicons described previously with words not necessarily originating from the training dataset, we created a lexicon of potentially useful words based on the mutual information between them and the classes considered. Mutual information (MI) is a measure of mutual dependence between two variables, and is calculated as follows given a word
The class-specific MI values were aggregated as MImax (
Finally, similarly to lexical clues linked to specific topics, we explored pattern-matching rules implemented as regular expressions in Java. To facilitate rule extraction, we used lexical clues with a simple concordance program 16 to analyze false negatives and add new rules iteratively. All rules were defined and selected manually by observing the ratio between true positives and false positives retrieved. Table 3 shows a sample of the pattern-matching rules associated with topics from the classification scheme.
Topics and examples of pattern-matching rules.
Feature Selection
The given topic classification task is a supervised learning problem, where each sentence is assigned to one or more predefined topics. The first problem associated with this task, or with text categorization in general, is the high dimensionality of feature space, since most often the features chosen are the frequencies of individual words. Such representation suffers from the curse of dimensionality (or Hughes effect), 17 where given a fixed size of the training dataset, the predictive power of a machine learning algorithm reduces as the dimensionality increases. In order to reduce the number of features we applied two strategies: generalization and mutual information.
As part of generalization, individual words were mapped to different types of categories. At the most general level, words were mapped to their POS, ie, syntactic information attached to words during linguistic pre-processing. A total of 21 POS classes were used. While, this certainly overgeneralizes individual words, some types of POS information do provide useful clues for topic classification purposes. For example, tokens containing numerical information, and thus tagged as CD (ie, cardinal number), vary widely in terms of their content, which itself does not provide information that can improve the classification. However, their general type alone (ie, CD) provides a useful classification clue, as it was mainly associated with the
All my things are at
Can be reached by phone State–-
$ <CD>
I am paid up there till
My Social Security number is
At a slightly lower level, we generalized individual words into their lexical domains. Namely, WordNet synsets are organized into 45 lexical domains based on syntactic category and logical groupings.
18
Here we provide some examples of words from the
<possession>
I give my sister power of attorney to <possession>
Don't <possession>
Don't <possession>
I also told Mr. J. Johnson not to <possession>
Where we needed more fine-grained groupings of words not provided by WordNet, we assembled our own lexicons to support lexical representation of relevant semantic categories (see Table 1) in addition to lexicons of words related to topics considered (see Table 2).
That <health>
Have him notify my <occupation>
May <religious>
<instruction>
You have been <anger>
<love>
Having generalized individual words in the ways described above, we counted the frequencies of their general categories and used them as features instead of counting the frequencies of individual words. This was done for all generic feature types, apart from the occupation type, where we differentiated between the words as they were shown to be associated with different topics, eg,
Fear: I know I should see a <occupation>
Instructions: Have him notify my <occupation>
Anger: They are gang of <occupation>
We also used individual words as features for those selected using mutual information, 15 an additional strategy we used to reduce the dimensionality of the feature space, to address a potential loss of important information due to overgeneralization. Mutual information, as one of the most effective feature selection mechanisms, was used to reduce the number of words considered by identifying the most informative ones. In this case, individual words were still used as features and their frequencies were counted. This reduced the number of words considered from 4,506 to 153 most informative ones.
In order to identify explicit mentions of a range of different emotions, we used the WordNet-Affect 3 lexicon described in the previous section, eg,
I've tasted the last bitter dregs of <despair>
Each word found in the lexicon was mapped to the associated emotion, often expressed by the given word. Where a word was mapped to multiple emotions (eg,
To identify the emotional tone of a sentence (ie, its positive or negative polarity) we used SentiWordNet,
4
which maps words to their positive and negative scores. The polarity of a sentence was calculated by aggregating the polarity scores of individual words in three ways: finding the maximum score, summing up the scores and averaging them. We used all of the aggregated scores as features, again leaving it to the machine learning algorithm to decide on the most useful features in terms of classification performance. Two other emotive lexicons were used, which simply classify words as being positive or negative.5,6 The positive or negative polarity of a sentence was quantified as the percentage of positive and negative words found in the sentence. We also counted the occurrence of negation words (eg,
So far, we used the bag-of-words model in which each sentence is represented as an unordered collection of individual words normalized to their lemmas, ignoring the relationships between them. However, such information, eg, represented through the use of bigrams, may substantially improve the quality of features, thus increasing the overall classification performance. 19 Still, matching longer phrases may lead to a decrease in performance due to high dimensionality and low frequency. 20 Therefore, it is essential to optimize the choice of more complex features. Instead of matching exact phrases, we opted for regular expressions as a more flexible way of representing relationships between individual words. Such flexibility results in both lower dimensionality and higher frequency of the features, thus avoiding the problem of degrading the performance associated with the introduction of longer phrases into the feature space. We also conflated the rules by associating them with specific topics and aggregating their frequencies to further address the dimensionality and frequency issues. A separate feature was introduced for each topic and its values were calculated as the number of regular expressions matched from the corresponding set. Table 3 provides examples of regular expressions used to introduce more complex features on top of those based on individual words.
To summarize, each sentence was mapped to its feature vector consisting of different feature types described in Table 4.
Features used to represent a sentence.
Principal Component Analysis
Compared to a brute-force bag-of-words approach, we managed to significantly reduce the number of features from over 4,500 to only 353. The selected features combined a set of specific words chosen using mutual information as well as more general lexico-semantic properties of individual words. This addressed the high dimensionality and data sparsity issues. The next problem to be resolved was to identify dependencies between the selected features, as they may decrease the performance of a naïve Bayes classifier, whose underlying probability model is based on an assumption of independence between the features. Therefore, we applied principal component analysis (PCA), 12 a mathematical procedure that transforms a complex feature space into a simplified structure that underlies it, by identifying the underlying, uncorrelated variables (called principal components) that account for most of the variation in the training data. We used Weka, 11 a popular suite of machine learning software, to perform PCA and retain enough principal components to account for 95% of the variance in the training data. As a result, the dimensionality of the data space was reduced from 353 original features to 245 principal components. Here are the five top-ranked principle components: a
Features written in capital letters only refer to POS tags used as features, eg, VB, NN and CD refer to verb, noun and cardinal number respectively. Qualifiers written in capital letters refer to the type feature, eg, VERB and NOUN refer to lexical domains of verbs and nouns in WordNet, WNA and SWN refer to emotive lexicons WordNet–Affect and SentiWordNet, whereas CLUE, PATTERN and MI correspond to features based on lexical clues, pattern matching and mutual information.
–0.17VB–0.169SWN.possum-0.16SWN.negsum-0.158VERB.social–0.157VERB.stative…
–0.199CLUE.information–0.184NN–0.179NOUN artifact–0.174CD–0.172NOUN.communication…
–0.215CLUE.forgiveness–0.208WNA.devotion-0.205WNA.affection-0.189CLUE.guilt–0.189 NOUN.person…
–0.299WNA.distress–0.299WNA.anxiety-0.28WNA. joy–0.24CLUE.happiness_peacefulness–0.235 PATTERN.hopefulness…
0.261WNA.defeatism+0.259MI.love+0.232CLUE.love–0.227WNA.devotion–0.225WNA.affection…
Obviously, PCA helped identify new meaningful underlying variables. It is interesting to notice how different types of features were grouped together reflecting their related meanings. For instance, in the second principal component manually identified lexical clues related to the
Naïve-Bayesian Classification
Having performed PCA to minimize feature dependence, we applied a naïve Bayes classifier,
21
a probabilistic learning method based on Bayes’ theorem, which combines evidence
When multiple hypotheses are considered, Bayes’ theorem provides the probability of each hypothesis being true given the evidence. During a training step, a naïve Bayes classifier uses the training data to estimate the parameters of a probability distribution (on the right-hand side of the equation). During a testing phase, a naïve Bayes classifier uses these estimations to compute the posterior probability (the left-hand side of the equation) given a test sample. Each test sample is then classified using the hypothesis with the largest posterior probability.
We again used Weka
11
to train a naïve Bayes classifier using the PCA-transformed dataset. In particular a discriminative multinominal naïve Bayes classifier was chosen from other varieties offered in Weka as it provided the best results during cross-validation. To classify sentences in the previously unseen test dataset, we applied the naïve Bayes model produced in the training phase to obtain the posterior probability for each class. Each sentence was classified using a single topic with the largest posterior probability. Given the heterogeneous nature of sentences not labeled with any topic, and therefore a difficulty in correctly predicting a topic as
Rule-Based Classification
The given task allowed multiple topics to be associated with a single sentence. To support multiple classifications, we could have used more than one highly probable class suggested by the naïve Bayes classifier. However, given the relatively small size of the training dataset, we believed that accepting classes with lower probabilities would significantly reduce the precision of the method, and consequently the overall classification performance estimated by the F–measure. Alternatively, we chose to apply rule–based classification following the predictions suggested by the naïve Bayes classifier. We relied on a subset of the pattern–matching rules illustrated in Table 3. By using only those rules that exhibited high precision on the training data, we aimed to maximize the precision. On the other hand, by supplementing the existing predictions with other correct classifications, we aimed to increase the recall. By increasing the recall while at least maintaining the precision, we aimed to improve the overall classification performance.
Evaluation
We received two datasets: 600 notes to be used for training and additional 300 notes to be used for testing. All sentences were annotated with topics shown in Table 2, with multiple annotations allowed where appropriate. The training dataset consisted of a total of distinct 4,315 sentences, out of which 2,145 sentences were associated with a total of 2,494 annotations. Note that we removed duplicated sentences from the dataset. We also manually corrected some annotations in the training dataset to remove some types of noise. For example, the same sentence
We submitted three system runs with three threshold values used to reject less probable predictions suggested by the naïve Bayes classifier described earlier. The performance was evaluated using recall and precision, as well as their combination into the F-measure, whose values were micro-averaged. 1 Table 5 describes the results provided by the organizers for the three runs, where T, N, P, R, and F denote the threshold, number of annotations, precision, recall and F-measure respectively. As expected, the lower the threshold the better the recall, and vice versa, the higher the threshold the better the precision. The best overall performance was achieved in Run 2, with the F-measure of 53.34%. The results achieved by the 26 teams that took part in the challenge were summarized by the organizers as follows: mean 48.75, standard deviation 0.0742, minimum 29.67, maximum 61.39, and median 50.27.
The three test run results.
To gain more insight into the results, we ran our own evaluation script to obtain the performance measures across all classes for Run 2. As before, Table 6 provides the values for the three evaluation measures, in addition to the numbers of true positives (TP), false positives (FP) and false negatives (FN), for both training and testing datasets. The threshold used to obtain the results on the training data was also 0.30. Our evaluation script does not differentiate between documents, as we considered sentences as such and we removed the duplicate ones, which is accounted for a slight difference between these values and the values provided by the organizer. As expected, poor results were achieved for small-size classes such
The evaluation results achieved during training and testing phases.
Conclusion
The given topic classification task is a supervised learning problem, which can be approached with machine learning (ML) or a rule-based methodology. Given the subjective nature of the given classification task and the variety of sentences in terms of their lexical content and syntactic structure, which would lead to proliferation of suitable classification rules if such rules could be identified, we opted for ML as the primary methodology used in our approach. Still, we incorporated pattern–matching rules as part of feature extraction, as well as for rule–based classification for a subset of sentences, whose lexico-syntactic properties exhibited high correlation with the associated topics. Another motivation for favoring the ML approach was the provision of annotated data of a decent size to support the training of different ML algorithms. We conducted experiments with a range of different algorithms supported by Weka, 11 a popular suite of ML software. We took advantage of the naïve Bayes classifier, as it does not necessarily require a lot of training data to perform well, 21 which was confirmed by our experiments. Given the F–measure of 53% achieved, our system significantly outperformed the majority of other competing systems, based on the comparison with the median F–measure of 50%. Given a large amount of noise noticed by manually inspecting the training data, we believe that the performance can be significantly increased with the quality and size of the training data available. More importantly, the core of the system is portable between different domains with the changes restricted to the lexicons assembled internally to model relevant semantic types and the set of pattern–matching rules associated with the classification scheme.
Disclosures
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
