Abstract
Introduction
Suicide is a major public health problem. In 2007, suicide was the tenth leading cause of death in the U.S., accounting for 34,598 deaths, with an overall rate of 11.3 suicide deaths per 100,000 people. 1 The suicide rate for men is four times that of women, with an estimated 11 attempted suicides occurring per every suicide death. 1 Suicidal behavior is complex, with biological, psychological, social, and environmental risks and triggers. 2 Some risk factors vary with age, gender, or ethnic group and may occur in combinations or change over time. Risk factors for suicide include depression, prior suicide attempts, a family history of mental disorder or substance abuse, family history of suicide, firearms in the home, and incarceration.2–8 Men and the elderly are more likely to have fatal attempts than are women and youth. 5
Suicide notes have long been studied as ways to understand the motives and thoughts of those who attempt or complete a suicide effort.
9
Given the impact of suicide and other mental health disorders, the broad goal of organizers from the 2011 Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing (NLP) shared task (track two) was to develop methods to analyze subjective neuropsychiatric free-text. To further that goal, this challenge focused on sentiment analysis, predicting the presence or absence of 15 emotions in suicide notes. Our team explored multiple approaches combining regular expression-based rules, statistical text mining (STM), and an approach that applies weights to text while accounting for multiple labels. Overall, our best system achieved a micro-averaged
Background
Sentiment analysis is concerned with identifying emotions, opinions, evaluations, etc. within subjective material. 10 A sizable portion of research in sentiment analysis has focused on business-related tasks such as analyzing product and company reviews,11–14 which are typically coherent and well written. 15 These analyses commonly focus on the polarity of words to classify whether a review is positive or negative. Unfavorable reviews can then be examined to identify and address negative mentions of products or services through a customer support function.
Correctly determining sentiment can be difficult for a number of reasons. First, the polarity of a word from a lexicon may not match when taken in context. 16 For instance, the word “reasonable” in a lexicon is positive, but the word takes a negative meaning in the sentence “It's reasonable to assume the crowd was going to become violent”. Second, words may have multiple senses which change the meaning of a statement. For instance, the word “sad” can mean an experience of sorrow (eg, “I feel sad all the time”), but it can also indicate being in a bad situation (eg, “I'm in such a sad state”). Finally, multiple emotions, opinions, etc. may be contained in a single document, making interpretation at the document level more difficult. Thus, classification may be done at the word 17 or sentence 14 level of analysis, instead of the document 18 level of analysis.
Methods
The subsections below provide a description of the dataset, preprocessing done to the data, modeling techniques used, and finally how the techniques were combined together to create ensemble models.
Dataset
The entire dataset consisted of 900 suicide notes collected over a 70-year period (1940–2010) from people who committed suicide. a 600 of the notes were made available for training, with the remaining 300 held-out for testing submitted systems. All names, dates, and locations were changed in the notes. Everything else in the notes were typed as written, retaining all errors in spelling and grammar. The notes were split on sentences and tokenized.
A more in-depth description of the dataset is available in Pestian et al.
19
For the competition, each sentence was reviewed by three annotators and assigned zero to many labels representing emotions/concepts (eg, ABUSE, INFORMATION, LOVE). The sentence-level inter-annotator agreement for the training and test dataset was 0.546. In both datasets, roughly half the sentences were assigned a label, with relatively few of those having multiple labels.
Preprocessing
Both the training and test datasets were preprocessed before training or applying any models. A summary of changes made to the data are provided below.
Contractions were separated at the apostrophe during the original tokenization process. Thus, the added white space was removed (eg,
A number of contractions used asterisks in place of apostrophes. To standardize, all asterisks were replaced with apostrophes.
A large number of misspellings were encountered while reading through the training notes. A two-step automated approach was used to help correct these errors. First, a custom dictionary was used to ignore and/or correct a small subset of words not present in the standard dictionary used in the second step. For instance, contractions without apostrophes (eg,
Modeling
In the data, a small but not insubstantial number of sentences had more than one label assigned (302 sentences or 6.51% of all sentences). To allow the use of a wide array of machine learning algorithms and toolkits the data were transformed from a multi-label to a single-label classification problem, where each label was converted into an independent single-label binary classification. The data were then formatted with each sentence as a row of data, along with the note ID, sentence number, and binary variables representing each of the 15 labels.
The following subsections describe the three different modeling techniques used with the newly formatted dataset. The purpose of investigating multiple techniques was to create ensemble models of complimentary methods. First, rules using regular expressions were created to find generalizable patterns–-especially within labels with little data. Second, STM was used to discover more complex patterns of word usage and because classifiers based on machine learning generally perform better than rules on sentiment classification tasks. 13 Finally, a unique method of applying weighting schemes to text while accounting for multiple labels was investigated.
Rules
Rule-based systems have commonly been used for categorization of textual documents. 20 For this competition, rules were an attractive method due to the small sample size for many labels. Relying on machine learning algorithms alone for such labels would have likely resulted in unstable models. Thus, rules were used as a complimentary method. The purpose of the rule-based system was to discover phrases (rules) that made intuitive sense, were generalizable to the test data, and limited false positives. The semi-automated process used to generate rules for each label is described below.
Sentences were categorized as either being
Each sentence in the
Any phrases found in the
The list of remaining phrases were then examined manually. Phrases without intuitive meaning for the label were discarded. For instance, the phrase “my oldest boy” was discarded for the ABUSE label, but “abusive behavior” was kept. Variations and expansions of the remaining phrases were created as necessary.
After the entire process, over 4,000 phrases/rules were retained (more than one rule may exist per sentence). Table 1 shows the breakdown of rules by label.
Number of rules by label.
Statistical text mining
Although rules were created for each label, the patterns being matched in the rules were fairly simplistic and prone to overfitting–-ie, looking for the exact same word usage. Therefore, STM was used as a complimentary method in hopes of discovering more robust models that have increased generalizability to the test set–-especially among labels with larger sample sizes (eg, INSTRUCTIONS).
For the first step of the STM process, the data (ie, sentences) were transformed into a term-by-document matrix by converting all text to lowercase; tokenizing; removing stopwords and tokens with fewer than three characters; stemming; and finally removing terms that only occurred once in the data. The result was a term-by-document matrix with 1,895 terms and 4,633 documents (sentences).
Next, models using three distinctly different machine learning algorithms were trained: Decision Trees (DTs), k-Nearest Neighbor (kNN), and Support Vector Machines (SVMs). Table 2 summarizes the parameters used with each algorithm. Greater detail of the process and parameters used are given in the list below.
Statistical text mining modeling parameters.
Decision Trees –
The top
k-Nearest Neighbor –
Three factors were used in weighting the term-by-document matrix: (1) term frequency, (2) collection frequency, and (3) normalization factor.
23
Term frequency and cosine normalization were used for the first and third weighting factors, respectively. The same three term weighting formulas used in DT were used for the second weighting factor. Like in DT, the top
Support Vector Machines–
The same weighting procedure from kNN was used. In addition, Latent Semantic Analysis (LSA)
24
employing Singular Value Decomposition (SVD) was used as a dimension reduction technique. The top
Finally, the performance for each combination of parameters were compared using 10-fold stratified cross-validation,
26
where the weighting methods, selection of top
Weights
In addition to STM, we also explored a method of applying weights to text while accounting for multiple labels. A total of four formulas based on chi-square
27
and a modified version of the Gini index
28
were used to generate weights. Equation 1 provides the formula for the modified version of the Gini index (
Weight formulas.
Table 4 summarizes the four formulas used to calculate weights along with a short description of their calculation. The formulas were used to create sets of features for input into data mining models–-ie, for each formula used, a feature would be created for each label. The set notation {} (used in Table 4 below) represents which groups of formulas were used to create features. For instance, {
Weight-based modeling parameters.
In addition to the weight-based measures of the text, features representing structural elements of the text were also included in all models. A description of the structural features are described in more detail below.
The weight and structural features described above were calculated for all sentences using distinct terms (after removing stop words). Three different machine learning algorithms were used: Decision Trees (DT), Logistic Regression (LR), and Support Vector Machines (SVM). Table 4 summarizes the parameters used with each algorithm. Greater detail of the process and parameters used are given in the list below.
Decision Trees–
C4.5-based decision trees 22 were used. However, unlike the process used in STM, the numeric value of each feature was used instead of the simple presence or absence of a feature. In addition, two additional criteria were examined for splitting nodes: accuracy and information gain. As shown in Table 4, five different feature sets were included as inputs to the decision tree.
Logistic Regression–
The same feature sets used in DT were also used. Models were created with logistic model trees, a method that builds trees with logistic regression models in their leaves. 29
Support Vector Machines–
The same feature sets used in DT and LR were also used. The performance using four different kernels was investigated: linear, poly, sigmoid, and RBF. 30
Finally, similar to the STM process, the performance for each combination of parameters were compared using 10-fold stratified cross-validation, 26 and the best performing models for each label and algorithm were selected.
Ensemble models
Ensemble models were used to capitalize on the strengths of different modeling techniques and methods (algorithms). Each method within an ensemble was given an equal vote. A sentence meeting or exceeding a set number of votes was predicted as “positive” for the specified label. A two-stage process determined the makeup of the ensembles.
The first stage focused on methods within a technique. All method combinations from the same technique were evaluated, allowing one, two, or three votes to decide on a positive classification. (Requiring only a single vote would increase recall at the expense of precision, whereas two or three votes would do the opposite.) For instance, STM had three methods for a total of seven combinations: {DT}, {kNN}, {SVM}, {DT, kNN}, …, {DT, kNN, SVM}. All seven combinations were evaluated using one vote, four combinations with two votes, and one combination with three votes; resulting in 12 evaluations. In addition, individual model performance within a method was also investigated. Poor model performance can hurt the micro-averaged
The second stage combined methods from different techniques. The best two ensembles from each technique from the previous stage were selected. All combinations were done again (excluding combinations of only methods from the same technique), allowing one, two, or three votes using the same three cutpoints. For instance, assume
For submission, the best ensembles from four categories were compared and the top three were submitted. The categories include (1) rules only; (2) rules and STM; (3) rules and weights; and (4) rules, STM, and weights. Rules were included in each category because of the likelihood of doing better with small sized labels.
Results and Discussion
Table 5 lists the
Training set
Noteable is the large discrepancy in performance between the rules and the other two techniques. Due to small sample sizes and time constraints, the rules were built using the entire training dataset. Thus, the performance on the training dataset was expected to be overly optimistic to what would be seen on the test dataset. However, the other two techniques both used stratified cross-validation to train and test models. Thus, the training results of STM and weight-based models were assumed to be more in-line with what performance could be expected with test data.
After finding the best models per technique and method, a variety of ensemble models were created and tested. The ensemble models selected from the training set and submitted for the test set are shown in Tables 6 and 7. Table 6 shows what methods were included in the ensemble, the cutpoint used, and overall performance measures, whereas Table 7 breaks down
Training and testing performance by submission.
Since the rules were known to be overfit, the last two ensemble models were also calculated without including rules to get a more realistic performance estimate on the test set. Without rules, the submissions had
The first submission for the test set demonstrated the rules were overfit, dropping almost 0.50 in
The second and third submissions fared better than rules alone, increasing the
While the third submission did not result in any additional labels finding true positives, it did perform the best over-all. Submission 3 had the highest
The results of the third submission were analyzed for errors. A random sample of up to 50 false positives and 50 false negatives were examined for each label. Overall, a few common themes emerged.
A clear delineation between various labels was difficult to discern. For instance, sentences incorrectly classified as INFORMATION instead of INSTRUCTIONS and vice versa.
Complex language usage was not accounted for because our techniques employed shallow text analysis. For instance, errors were found in sentences with sarcasm (eg, “also am sorry you never cared” → ANGER), negation (eg, “… she doesn't love me …” → not LOVE), and emotions stated in a general sense rather than expressed by the writer (eg, “… all us good men expect from the woman we love …”→ not LOVE).
Wide variability in word usage and meaning made uncovering robust and generalizable patterns challenging, especially for rules. Having a document collection that spanned a 70-year period and included writers of heterogeneous backgrounds contributed to the variation.
Finally, it was unclear why some sentences were or were not assigned to certain labels in the gold standard. It appeared some assignments were based on context from surrounding sentences, but others were not as apparent.
Conclusion
This paper described our team's submissions to the 2011 i2b2 NLP shared task competition (track two). Our submissions used individual and ensemble systems consisting of regular expression-based rules, STM models, and weight-based models. Our three submissions obtained micro-averaged
Disclosures
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
