Abstract
Text data are pervasive in organizations. Digitization (Cardie & Wilkerson, 2008) and the ease of creating online information (e.g., e-mail messages; Berry & Castellanos, 2008) contributes to the vast quantities of text generated each day. Embedded in these texts is information that may improve our understanding of organizational processes. Thus, organizational researchers increasingly seek ways to organize, classify, label, and extract opinions, experiences, and sentiments from text (Pang & Lee, 2008; Wiebe, Wilson, & Cardie, 2005). Up until recently, the majority of text analyses in organizations relied on time consuming and labor-intensive manual procedures, which are impractical and less effective for voluminous collections of
Similar to content analysis (Duriau, Reger, & Pfarrer, 2007; Hsieh & Shannon, 2005; Scharkow, 2013) and template analysis (Brooks, McCluskey, Turley, & King, 2015), a common objective of text analysis is to assign text to predefined categories. Manually assigning large collections of text to categories is costly and may become inaccurate and unreliable due to cognitive overload. Furthermore, idiosyncrasies among human coders may creep into the labeling process resulting in coding errors. One workaround is to code only part of the
This article focuses on automatic text classification for several reasons. First, although text classification (henceforth TC) has been applied in various fields, such as in political science (Atteveldt, Kleinnijenhuis, Ruigrok, & Schlobach, 2008; B. Yu, Kaufmann, & Diermeier, 2008), occupational fraud (Holton, 2009), law (Gonçalves & Quaresma, 2005), finance (Chan & Chong, 2017; Chan & Franklin, 2011; Kloptchenko et al., 2004), and personality research (Shen, Brdiczka, & Liu, 2013), so far its uptake in organizational research is limited. Second, the use of TC is economical both in terms of time and cost (Duriau et al., 2007). Third, many of the techniques that have been developed in TC, such as sentiment analysis (Pang & Lee, 2008), genre classification (Finn & Kushmerick, 2006), and sentence classification (Khoo, Marom, & Albrecht, 2006) seem particularly well suited to address contemporary organizational research questions. Fourth, the acceptance and broader use of TC within the organizational research community can stimulate the development of novel TC techniques.
Tutorials or review-tutorials on TC that have been published so far (Harish, Guru, & Manjunath, 2010; Li & Jain, 1998; Sebastiani, 2002) were targeted mainly toward researchers in the field of machine learning and data mining. This has resulted in a skewed focus on technical and methodological details. In this article our goal is to balance the discussion among techniques, theoretical concepts, and validity concerns to increase the accessibility of TC to organizational researchers.
Below we first discuss the TC process, by pointing out key concerns and providing concrete recommendations at each step. Previous studies are cited to enrich the discussion and to illustrate different use cases. The second part is a hands-on tutorial using part of our own work as a running example. We applied TC to automatically extract nursing job tasks from nursing vacancies to augment nursing job analysis (Kobayashi, Mol, Kismihók, & Hesterberg, 2016). The findings from this study were used in the EU-funded Pro-Nursing (http://pro-nursing.eu) project which aimed to understand, among others, how nursing tasks are embedded in the nursing process. We also address validity assessment because the ability to demonstrate the validity of TC outcomes will likely be critical to its uptake by organizational researchers. Thus, we discuss and illustrate how to establish validity for TC outcomes. Specifically, we address assessing the predictive validity of the classifier and triangulating the output of the classification with other data sources (e.g., expert input and output from alternative analyses).
Text Classification
TC is defined as the automatic assignment of text to one or more predefined classes (Li & Jain, 1998; Sebastiani, 2002). Formally, the task of TC is stated as follows. Given a set of text and a set of categories, construct a model of the form:
An ideal classifier would mimic how humans process and deduce meaning from text. However, there are still many challenges before this becomes reality. Natural languages contain high-level semantics and abstract concepts (Harish et al., 2010; Popping, 2012) that are difficult to articulate in computer language. For instance, the meaning of a word may change depending on the context in which it is used (Landauer, Foltz, & Laham, 1998). Also, lexical, syntactic, and structural ambiguities in text are continuing challenges that would need to be addressed (Hindle & Rooth, 1993; Popping, 2012). Another issue is dealing with typographical errors or misspellings, abbreviations, and new lexicons. Strategies for dealing with ambiguities all need to be explicated during classifier development. Before a classifier is deployed it thus needs several rounds of training, testing, fine-tuning (of parameters), and repeated evaluation until acceptable levels of performance and validity are reached. The resulting classifier is expected to approximate the performance of human experts in classification tasks (Cardie & Wilkerson, 2008), but for a large corpus its advantage is that it will be able to do so in a faster, cheaper, and more reliable manner.
TC: The Process
The TC process consists of six interrelated steps, namely (a) text preprocessing, (b) text representation or transformation, (c) dimensionality reduction, (d) selection and application of classification techniques, (e) classifier evaluation, and (f) classifier validation. As with any research activity, before starting the TC process, we begin by formulating the research question and identifying text of interest. Here, we assume that classes are predefined and that the researcher has access to, or can gather, documents with known classes, that is, the
Text Preprocessing for Classification
The purpose of preprocessing is to remove irrelevant bits of text as these may obscure meaningful patterns and lead to poor classification performance and redundancy in the analysis (Uysal & Gunal, 2014). During preprocessing we first apply
Punctuations and numbers, if deemed irrelevant to the classification task at hand are removed, although in some cases these may be informative and thus retained (exclamation marks or emoticons, for instance, may be indicative of sentiment). Dictionaries or lexicons are used to apply spelling correction, and to resolve typos, and abbreviations. Words that are known to have low information content such as conjunctions and prepositions are typically deleted. These words are called
During preprocessing
A practical question is: what preprocessing techniques to apply for a given text? The answer is largely determined by the nature of text (e.g., language and genre), the problem that we want to address, and the application domain (Uysal & Gunal, 2014). Any given preprocessing procedure may be useful for a specific domain of application or language but not for others. Several empirical studies demonstrated the effect of preprocessing on classification performance. For example, stemming in the Turkish language does not seem to make a difference in classification performance when the size of the training data set is large (Torunoğlu, Çakırman, Ganiz, Akyokuş, & Gürbüz, 2011). In some applications stemming even appears to degrade classification performance, particularly in the English and Czech languages (Toman et al., 2006). In the classification of English online news, the impact of both stemming and stopword removal is negligible (Song, Liu, & Yang, 2005). In general, the classification of English and Czech documents benefits from stopword removal but may suffer from word normalization (Toman et al., 2006). For the Arabic language, certain classifiers benefit from stemming (Kanaan, Al-Shalabi, Ghwanmeh, & Al-Ma’adeed, 2009). In spam email filtering, some words typically seen as stopwords (e.g., “however” or “therefore”) were found to be particularly rare in spam email, hence these should not be removed for this reason (Méndez, Iglesias, Fdez-Riverola, Díaz, & Corchado, 2006).
Recommendation
In using English documents, our general recommendation is to apply word tokenization, convert upper case letters to lower case, and apply stopword removal (except for short text such as email messages and product titles; Méndez et al., 2006; H.-F. Yu, Ho, Arunachalam, Somaiya, & Lin, 2012). Since the effects of normalization have been mixed, our suggestion is to apply it only when there is no substantial degradation on classification performance, since it can increase classification efficiency by reducing the number of terms. When in doubt whether to remove numbers or punctuations (or other symbols), our advice is to retain them and apply the dimensionality reduction techniques discussed in the below section on text transformation.
Text Transformation (X )
Text transformation is about representing documents so that they form a suitable input to a classification algorithm. In essence, this comprises imposing structure on a previously unstructured text. Most classification algorithms accept vectors or matrices as input. Thus the most straightforward way is to represent a document as a vector and the corpus as a matrix.
The most common way to transform text is to use the so-called
Other weighting options can be derived from basic count weighting. One can take the logarithm of the counts to dampen the effect of highly frequent terms. Here we need to add 1 to the counts so that we avoid taking the logarithm of zero counts. It is also possible to normalize with respect to document length by dividing each count by the maximum term count in a given document. This is to ensure that frequent terms in long documents are not overrepresented. Apart from the weights of the terms in each document, terms can also be weighted with respect to the corpus. Common corpus-based weights include the inverse document frequency (IDF), which assesses the specificity of terms in a corpus (Algarni & Tairan, 2014). Terms that occur in too few (large IDF) or in too many (IDF close to zero) documents have low discriminatory power and are therefore not useful for classification purposes. The formula for IDF is:
Although the VSM ignores word order information, it is popular due to its simplicity and effectiveness. Ignoring word order means losing some information regarding the semantic relationships between words. Also, words alone may not always express true atomic units of meaning. Some researchers improve the VSM by adding adjacent word pairs or trios (
Text transformation plays a critical role in determining classification performance. Inevitably some aspects of the text are lost in the transformation phase. Thus, when resulting classification performance is poor, we recommend that the researcher reexamines this step. For example, while term-based features are popular, if performance is poor one could also consider developing features derived from linguistic information (e.g., parts of speech) contained in text (Gonçalves & Quaresma, 2005; Kobayashi et al., 2017; Moschitti & Basili, 2004) or using consecutive characters instead of whole words (e.g., n-grams; Cavnar & Trenkle, 1994).
Reducing dimensionality
Even after preprocessing, transformation through VSM is still likely to result in a large feature set. Too large a number of features is undesirable because it may increase computational time and may degrade classification performance, especially when there are many redundant and noisy features (Forman, 2003; Guyon & Elisseeff, 2003; Joachims, 1998). The size of the vector and hence the size of feature set is referred to as the
One way to eliminate features is to first assign scores to each feature and then remove features by setting a cutoff value. This is called
Another group of strategies to score features is to make use of class membership information in the training data. These methods are called
An alternative to scoring methods is to create latent orthogonal features by combining existing features. Methods that construct new features from existing ones are known as
Recommendation
Our recommendation is start with the traditional VSM, that is, transform the documents into vectors using single terms as features. For the unsupervised scoring, compute the DF of each term and filter out terms with very low and very high DF, customarily those terms belonging to the lower 5th and upper 99th percentiles. For the supervised scoring try CHI and IG and for the feature transformation try LSA and nonnegative matrix factorization. Compare the effect on classification performance of the different feature sets generated by the methods and choose the feature set that yields the highest performance (e.g., accuracy). We also suggest to try combining scoring and transformation methods. For example, one can first run CHI and perform LSA on the terms selected by CHI. Note that the quality of the feature set (and that of the representation) is assessed based on its resulting classification performance (Forman, 2003).
For LSA and nonnegative matrix factorization, we need to decide how many dimensions to retain. For LSA, Fernandes, Artífice, and Fonseca (2017) offered this formula as a rough guide
Application of TC Algorithms (f )
The transformed text, usually the original DTM or the dimensionality reduced DTM, serves as input to one or more classification techniques. Most techniques are from the fields of machine learning and statistics. There are three general types of techniques: (a) geometric, (b) probabilistic, and (c) logical (Flach, 2012).
Geometric algorithms assume that the documents can be represented as points in a hyperspace, the dimensions of which are the features. This means that distances between documents and lengths of the documents can be defined as well. In this representation, nearness implies similarity. An example of a geometric classifier is
Probabilistic algorithms compute a joint probability distribution between the observations (e.g., documents) and their classes. Each document is assumed to be an independent random draw from this joint probability distribution. The key point in this case is to estimate the posterior probability
The third type of algorithm is the logical classifier, which accomplishes classification by means of logical rules (Dumais, Platt, Heckerman, & Sahami, 1998; Rokach & Maimon, 2005). An example of such a rule in online news categorization is: “If an article contains any of the stemmed terms “vs”, “earn”, “loss” and not the words “money”, “market open”, or “tonn” then classify the article under category “earn” (Rullo, Cumbo, & Policicchio, 2007). The rules in logical models are readable and thus facilitate revision, and, if necessary, correction of how the classification works. An example of a logical classifier is a
Naive Bayes and support vector machines are popular choices (Ikonomakis, Kotsiantis, & Tampakas, 2005; Joachims, 1998; Li & Jain, 1998; Sebastiani, 2002). Both can efficiently deal with high dimensionality and data sparsity, though in naive Bayes appropriate smoothing will need to be applied to adjust for terms which are rare in the training data. The method of K-nearest neighbor works well when the amount of training data is large. Both logistic regression and discriminant analysis yield high performance if the features are transformed using LSA. The performance of decision trees has been unsatisfactory. A number of researchers therefore recommend the strategy of training and combining several classifiers to increase classification performance, which is known as ensemble learning (Breiman, 1996; Dietterich, 1997; Dong & Han, 2004; Polikar, 2012). This kind of classification can be achieved in three ways. The first is using a single method and training it on different subsets of the data. Examples include bagging and boosting, which both rely on resampling. Random forest is a combination of bagging and random selection of features that uses decision trees as base learners. Gradient boosted trees, a technique that combines several decision trees, has been shown to significantly increase performance as compared with that of individual decision trees (Ferreira & Figueiredo, 2012). The second is using a single method but varying the training parameters such as, for example, using different initial weights in neural networks (Kolen & Pollack, 1990). The third is using different classification techniques (naive Bayes, decision trees, or SVM; Li & Jain, 1998) and combining their predictions using, for instance, the majority vote.
Recommendation
Rather than using a single technique, we suggest applying different methods, by pairing different algorithms and feature sets (including those obtained from feature selection and transformation) and choosing the pair with the lowest error rate. For example, using the DTM matrix, apply SVM, naive Bayes, random forest bagging, and gradient boosted trees. When feature transformation has been applied (e.g., LSA and nonnegative matrix factorization), use logistic regression or discriminant analysis. When the training data are large (e.g., hundreds of thousands of cases), use K-nearest neighbors. Rule-based algorithms are seldom used in TC, however, if readability and efficiency are desired in a classifier, then these can be trialed as well.
Evaluation Measures
Crucial to any classification task is the assessment of the performance of classifiers using evaluation measures (Powers, 2011; Yang, 1999). These measures indicate whether a classifier models the relationship between features and class membership well, and may thus be used to indicate the extent to which the classifier is able to emulate a human coder. The most straightforward evaluation measure is the accuracy measure, which is calculated as the proportion of correct classifications. Accuracy ranges from 0 to 1 (or 0 to 100 when expressed as a percentage). The higher the accuracy the better the classifier (1 corresponds to perfect classification). However, in case of imbalanced classification (i.e., when there is one class with only a few documents) and/or unequal costs of misclassification, accuracy may not be appropriate. An example is detecting career shocks (cf. Seibert, Kraimer, Holtom & Pierotti, 2013) in job forums. Since it is likely that only a small fraction of these postings pertain to career shocks (suppose .05), a classifier can still have a high accuracy (equal to .95) even if that classifier classifies all discussion as containing no career shocks content.
Alternative measures to accuracy are precision, recall, F-measure (Powers, 2011), specificity, breakeven point, and balanced accuracy (Ogura, Amano, & Kondo, 2011). In binary classification, classes are commonly referred to as positive and negative. Classifiers aim to correctly identify observations in the positive class. A summary table which can be used as a reference for computing these measures is presented in Figure 1. The entries of the table are as follows: TP stands for true positives, TN for true negatives, FP for false positives (i.e., negative cases incorrectly classified into the positive class), and FN for false negatives (i.e., positive cases incorrectly classified into the negative class). Hence the five evaluation measures are computed as follows:

Confusion matrix as a reference to compute the evaluation measures.
The breakeven point is the value at which
Evaluation measures are useful to compare the performance of several classifiers (Alpaydin, 2014). Thus, one can probe different combinations of feature sets and classification techniques to determine the best combination (i.e., the one which gives the optimal value for the evaluation measure). Apart from classification performance, one can also take the parsimony of the trained classifier into account by examining the relative size of the different feature sets, since they determine the complexity of the trained classifier. In line with Occam’s razor, when two classifiers have the same classification performance, the one with the lower number of features is to be preferred (Shreve, Schneider, & Soysal, 2011).
Evaluation measures are computed from the labeled data. It is not advisable to use all labeled data to train the classifier since this might result in
Cross-validation can be applied by computing not only one value for the evaluation measure but several values corresponding to different splits of the data. A systematic strategy to evaluate a classifier is to use
Recommendation
Since accuracy may give misleading results when classes are imbalanced we recommend using measures sensitive to this, such as F-measure or balanced accuracy (Powers, 2011). For the systematic evaluation of the classifier we advise using
Model Validity
Figure 2 illustrates that a classification model consists of features and the generic classification algorithm (Domingos, 2012). Thus the validity of the classification model depends both on the choice of features and the algorithm.

Diagrammatic depiction of the text classification process.
Many TC applications use the set of unique words as the feature set (i.e., VSM). For organizational researchers this way of specifying the initial set of features may seem counterintuitive since features are constructed in an ad hoc and inductive manner, that is, without reference to theory. Indeed, specifying the initial set of features, scoring features, transforming features, evaluating features, and modifying the set of features in light of the evaluation constitutes a data-driven approach to feature construction and selection (Guyon, Gunn, Nikravesh, & Zadeh, 2008). The validity of the features is ultimately judged in terms of the classification performance of the resulting classification model. But this does not mean that researchers should abandon theory based approaches. If there is prior knowledge or theory that supports the choice of features then this can be incorporated (Liu & Motoda, 1998). Theory can also be used as a basis for assigning scores to features such as using theory to rank features according to importance. Our recommendation, however, would be to have theory complement, as opposed to restrict, feature construction, because powerful features (that may even be relevant to subsequent theory building and refinement) may emerge inductively.
The second component, the classification algorithm, models the relationship between features and class membership. Similar to the features, the validity of the algorithm is ultimately determined from the classification performance and is also for the most part data driven. The validity of both the features and the classification algorithm establishes the validity of the classification model.
A useful strategy to further assess the validity of the classification model is to compare the classifications made by the model with the classification of an independent (group of) human expert(s). Usually agreement between the model and the human expert(s) is quantified using measures of concordance or measures of how close the classification of the two correspond to one another (such as Cohen’s kappa for interrater agreement where one “rater” is the classifier). Using expert knowledge, labels can also be checked against standards. For example, in job task extraction from a specific set of job vacancies one can check with experts or job incumbents to verify whether the extracted tasks correspond to those tasks actually carried out on the job and whether specific types of tasks are under or over represented.
Once model validity is established one may start applying the classification model to unlabeled data. However, the model will still need to be reevaluated from time to time. When the performance drops below an acceptability threshold, there are four possible solutions: (a) add more features or change existing features, (b) try other classification algorithms, (c) do both, and/or (d) collect more data or label additional observations.
Other Issues in TC
In this section we discuss how to deal with multiclass classification, where there is an increased likelihood of classes being imbalanced, and provide some suggestions on determining training size and what to do when obtaining labeled data is both expensive and difficult.
Multiclass classification
Multiclass classification pertains to dealing with more than 2 categories. The preprocessing and representation parts are the same as in the binary case. The only changes are in the choices of supervised feature selection techniques, classification techniques and evaluation measures. Most supervised feature selection techniques can be easily generalized to more than 2 categories. For example, when calculating CHI, we just need to add an extra column to the two-way contingency table. Most techniques for classification we discussed previously have been extended to multiclass classification. For example, techniques suited for binary classification problems (e.g., SVM) are extended to the multiclass case by breaking the multiclass problem into several binary classification problems in either one-against all or one-against-one approaches. In the former approach we build binary classifiers by taking each category as the positive class and merging the others into the negative class. Hence, if there are
The four evaluation measures can also be extended to classifications with more than two classes by computing these measures per category, the same as in one-against-all, and averaging the results. An example is the extension of F-measure called the
Imbalanced classification
By and large, in binary classification, when the number of observations in one class represents less than 20% of the total number of observations then the data can be seen as imbalanced. The main danger of imbalanced classification is that we may train a classifier with a high accuracy even if it fails to correctly classify the observations in the minority class. In some cases, we are more interested in detecting the observations in the minority class. At the same time however, we also want to avoid many false detections.
Obvious fixes are to label more observations until the classes are balanced as was done by Holton (2009), or by disregarding some observations in the majority class. In cases where classification problems are inherently imbalanced and labeling additional data is costly and difficult, another approach is to oversample the minority class or to undersample the majority class during classifier training and evaluation. A strategy called the synthetic minority oversampling technique (SMOTE) is based on oversampling but instead of selecting existing observations in the minority class it creates synthetic samples to increase the number of observations in the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Preprocessing and representation remain the same as in balanced classes. The parts that make use of class membership need to be adjusted for imbalanced data.
There are options for supervised dimensionality reduction for imbalanced classification such as those provided by Ogura et al. (2011). For the choice of classification techniques, those discussed previously can be used with minor variations such as adjusting the costs of misclassification, which is known as cost-sensitive classification (Elkan, 2001). Traditional techniques apply equal costs of misclassification to all categories, whereas, for cost-sensitive classification we can assign large cost for incorrect classification of observations in the minority class. For the choice of evaluation measures, we suggest using the weighted F-measure or balanced accuracy. One last suggestion is to treat imbalanced classification as an anomaly or outlier detection problem where the observations in the minority class are the outliers (Chandola, Banerjee, & Kumar, 2009).
Size of the training data
A practical question that often arises is how many documents one should label to ensure a valid classifier. The size of the training dataset depends on many considerations such as the cost and limitations associated with acquiring prelabeled documents (e.g., ethical and legal impediments) and the kind of learning framework we are using. In the probably approximately correct (PAC) learning framework, which is perhaps the most popular framework for learning concepts (such as the concept of spam emails or party affiliation) training size is determined by the type of classification technique, the representation size, the maximum error rate one is willing to tolerate, and the probability of not exceeding the maximum error rate. Under the PAC learning framework, formulae have been developed to determine the lower bound for the training size, an example being the one by Goldman (2010):
Although formulae provide theoretical guarantees, determining training size is largely empirically driven and involves a good deal of training, evaluation, and validation. To give readers an idea of training sizes as typically found in practice, Table 1 provides information about the training data sizes for some existing TC studies.
Training Sizes, Number of Categories, Evaluation Measures, and Evaluation Procedures Used in Various Text Classification Studies.
Suggestions when labeled data are scarce
In many classification problems, labeled data are costly or difficult to obtain. Fortunately, even in this case, principled approaches can be applied. In practice, unlabeled data are plentiful and we can apply techniques to make use of the structure and patterns in the unlabeled data. This approach of using unlabeled data in classification is called
Another approach is to use classification output to help us determine which observations to label. In this way, we take a targeted approach to labeling by labeling those observations which are most likely to generate better classifiers. This is called
Tutorial
We developed the following tutorial to provide a concrete treatment of TC. Here we demonstrate TC using actual data and codes. Our intended audience are researchers who have little or no experience with TC. This tutorial is a scaled down version of our work on using TC to automatically extract job tasks from job vacancies. Our objective is to build a classifier that automatically classifies sentences into task or nontask categories. The sentences were obtained from German language nursing job vacancies.
We set out to automate the process of classification because one can then deal with huge numbers (i.e., millions) of vacancies. The output of the text classifier can be used as input to other research or tasks such as job analysis or the development of tools to facilitate personnel decision making. We used the R software since it has many ready-to-use facilities that automate most TC procedures. We provide the R annotated scripts and data to run each procedure. Both codes and data can be downloaded as a Zip file from Github; the URL is https://github.com/vkobayashi/textclassificationtutorial. The naming of R scripts are in the following format: CodeListing (CL) <number>.R and in this tutorial we referenced them as CL <number>. Thus, CL 1 refers to the script CodeListing_1.R. Note that the CL files contain detailed descriptions of each command, and that each command should be run sequentially.
All the scripts were tested and are expected to work on any computer (PC or Mac) with R, RStudio, and the required libraries installed. However, basic knowledge including how to start R, open R projects, run R commands, and install packages in RStudio are needed to run and understand the codes. For those new to R we recommend following an introductory R tutorial (see, for example, DataCamp [www.datacamp.com/courses/free-introduction-to-r] or tutorialspoint [www.tutorialspoint.com/index.htm] for free R tutorials).
This tutorial covers each of the previously enumerated TC steps in sequence. For each step we first explain the input, elaborate the process, and provide the output, which is often the input for the subsequent step. Table 2 provides a summary of the input, process, and output for each step in this tutorial. Finally, after downloading the codes and data, open the text_classification_tutorial.Rproj file. The reader should then run the codes for every step as we go along, so as to be able to examine the input and the corresponding output.
Text Classification Based on the Input-Process-Output Approach.
Preparing Text
The input for this step consists of the raw German job vacancies. These vacancies were obtained from Monsterboard (www.monsterboard.nl). Since the vacancies are webpages, they are in hypertext markup language (HTML), the standard markup language for representing content in web documents (Graham, 1995). Apart from the relevant text (i.e., content), raw HTML pages also contain elements used for layout. Therefore, a technique known as HTML parsing is used to separate the content from the layout.
In R, parsing HTML pages can be done using the
To extract text from several HTML files, the codes in CL 1 are put in a loop in CL 2. The function
Preprocessing Text
The preprocessing step consists of two stages. The first identifies sentences in the vacancies, since the sentence is our unit of analysis, and the second applies text preprocessing operations on the sentences. We used sentences as our unit of analysis since our assumption is that the sentence is the right resolution level to detect job task information. We did not use the vacancy as our unit of analysis since a vacancy may contain more than one task. In fact if we chose to treat the vacancy as the unit of analysis it would still be important to identify which of the sentences contain task information. Another reason to select sentence as the unit of analysis is to minimize variance in document length. Input for the first stage are the text files generated from the previous step, and the output sentences from this stage serve as input to the second stage. CL 3 contains functions that can detect sentences from the parsed HTML file in the previous section (i.e.,
The code loads the
For multiple text files, the codes should again be run in a loop. One large text file will then be generated containing the sentences from all parsed vacancies. Since we put all sentences from all vacancies in a single file, we attached the names of the corresponding text vacancy files to the sentences to facilitate tracing back the source vacancy of each sentence. Thus, the resulting text file containing the sentences has two columns: the first column contains the file names of the vacancies from which the sentences in the second column were derived.
After applying sentence segmentation on the parsed vacancy in
The sentences are imported as a data frame in R (see CL 4). Since the sentence is our unit of analysis, hereafter we refer to these sentences as documents. The first column is temporarily ignored since it contains only the names of the vacancy files. Since the sentences are now stored in a vector (in the second column of the data frame), the VectorSource() function is used. The source determines where to find the documents. In this case the documents are in mysentences[,2]. If the documents are stored in another source, for example in a directory rather than in a vector, one can use DirSource(). For a list of supported sources, invoke the function getSources(). Once the source has been set, the next step is to create a corpus from this source using the VCorpus() function. In the tm package, corpus is the main structure for managing documents. Several preprocessing procedures can be applied to the documents once collected in the corpus. Many popular preprocessing procedures are available in this package. Apart from the existing procedures, users can also specify their own via user-defined functions. The procedures we applied are encapsulated in the
Text Transformation
CodeListing 5 details the structural transformation of the documents. The input in this step is the output from the preceding step (i.e., the cleaned sentences in the training data). To quantify text characteristics, we use the VSM because this is the simplest and perhaps most straightforward approach to quantify text and thus forms an appropriate starting point in the application of TC (Frakes & Baeza-Yates, 1992; Salton et al., 1975). For this transformation, the
The
We mentioned previously that for word features one can use raw counts as weights. The idea of using raw counts is that the higher the count of a term in a document the more important it is in that document. The
Let us assign a “weight” to a feature that reflects its importance with respect to the entire corpus using the DF. Another useful feature of DF is that it provides us with an idea of what the corpus is about. For our example the word with the highest DF (excluding stopwords) is
Another common text analysis strategy is to find keywords in documents. The keywords may be used as a heuristic to determine the most likely topic in each document. For this we can use the TF-IDF measure. The keyword for each document is the word with the maximum TF-IDF weight (ties are resolved through random selection). The codes in CL 6 compute the keyword for each document. For example, the German keyword for Document 4 is
The final DTM can be used as input to dimensionality reduction techniques or directly to the classification algorithms. The process from text preprocessing to text transformation culminated in the DTM that is depicted in Figure 3.

Illustration of text preprocessing from raw HTML file to document-by-term matrix.
Dimensionality Reduction
Before running classification algorithms on the data, we first investigate which among the features are likely most useful for classification. Since the initial features were selected in an ad hoc manner, that is, without reference to specific background knowledge or theory, it may be possible that some of the features are irrelevant. In this case, we applied dimensionality reduction to the DTM.
LSA is a commonly applied to reduce the size of the feature set (Landauer et al., 1998). The output of LSA yields new dimensions which reveal underlying patterns in the original features. The new features can be interpreted as new terms that summarize the contextual similarity of the original terms. Thus, LSA partly addresses issues of synonymy and in some circumstances, polysemy (i.e., when a single meaning of a word is used predominantly in a corpus). In R, the
To illustrate LSA we need additional vacancies. For illustrative purposes we used 11 job vacancies (see the
Documents and terms are projected onto the constructed LSA space in the
The German word
Since our aim is to reduce dimensionality, we project the documents to the new dimensions. This is accomplished through the corresponding codes in CL 8. From the LSA, we obtain a total of 107 new dimensions from the original 1,079 features. It is usually not easy to attach natural language interpretations to the new dimensions. In some scenarios, we can interpret the new dimension by examining the scaled coefficients of the terms on the new dimensions (much like in PCA). Terms with higher loadings on a dimension have a greater impact on that dimension. Figure 4 visualizes the terms with high numerical coefficients on the first 6 LSA dimensions (see CL 8 for the relevant code). Here we distinguish between terms found to occur in a task sentence (red) or not (blue). In this way, an indication is provided of which dimensions are indicative for each class (note that distinguishing between tasks and nontasks requires the training data, which is discussed in greater detail below).

Loadings of the terms on the first 6 LSA dimensions using 422 sentences from 11 vacancies.
Another approach that we could try is to downsize the feature set by eliminating those features that are not (or less) relevant. Such techniques are collectively called filter methods (Guyon & Elisseeff, 2003). They work by assigning scores to features and setting a threshold whereby features having scores below the threshold are filtered out. Both the DF and IDF can be used as scoring methods. However, one main disadvantage of DF and IDF is that they do not use class membership information in the training data. Including class membership (i.e., through supervised scoring methods) ought to be preferred, as it capitalizes on the discriminatory potential of features (Lan et al., 2009).
For supervised scoring methods, we need to rely on the labels of the training data. In this example, the labels are whether a sentence expresses task information (1) or not (0). These labels were obtained by having experts manually label each sentence. For our example, experts manually assigned labels to the 425 sentences. We applied three scoring methods, namely, Information Gain, Gain ratio, and Symmetric Uncertainty (see CL 12). Due to the limited number of labeled documents, these scoring methods yielded less than optimal results. However, they still managed to detect one feature that may be useful for identifying the class of task sentences, that is, the word
Classification
The reduced matrix from the preceding section can be used as input for classification algorithms. The output from this step is a classification model which we can then use to automatically classify sentences in new vacancies. We have mentioned earlier that reducing dimensionality is an empirically driven decision rather than one which is guided by specific rules of thumb. Thus, we will test whether the new dimensions lead to improvement in performance as compared to the original set by running separate classification algorithms, namely support vector machines (SVMs), naive Bayes, and random forest, on each set. These three have been shown to work well on text data (Dong & Han, 2004; Eyheramendy et al., 2003; Joachims, 1998).
Accuracy is not a good performance metric in this case since the proportion of task sentences in our example data is low (less than 10%). The baseline accuracy (computed from the model which assigns all sentences to the dominant class), would be 90% which is high, and thus difficult to improve upon. More suitable performance metrics are the F-measure (Ogura et al., 2011; Powers, 2011) and balanced accuracy (Brodersen, Ong, Stephan, & Buhmann, 2010). We use these two measures here since the main focus is on the correct classification of task sentences and we also want to control for misclassifications (nontask sentences put into the task class or task sentences put into the nontask class).
In assessing the generalizability of the classifiers, we employed 10 times 10 fold cross-validation. We repeated 10 fold cross-validation 10 times because of the limited training data. We use one part of the data to train a classifier and test its performance by applying the classifier on the remaining part and computing the F-measure and balanced accuracy. For the 10 times 10 fold cross-validation, we performed 100 runs for each classifier using the reduced and original feature sets. Hence, for the example we ran about 600 trainings since we trained 6 classifiers in total. All performance results reported are computed using the test sets (see CL 10).
From the results we see how classification performance varies across the choice of features, classification algorithms, and evaluation measures. Figure 5 presents the results of the cross-validation. Based on the F-measure, random forest yielded the best performance using the LSA reduced feature set. The highest F-measure obtained is 1.00 and the highest average F-measure is 0.40 both from random forest. SVM and naive Bayes have roughly the same performance. This suggests that among the three classifiers random forest is the best classifier using the LSA reduced feature set, and F-measure as the evaluation metric. If we favor the correct detection of task sentences and we want a relatively small dimensionality then random forest should thus be favored over the other methods. For the case of using the original features, SVM and random forest exhibit comparable performance. Hence, when using F-measure and the original feature set, either SVM or random forest would be the preferred classifier. The low values of the F-measures can be accounted for by the limited amount of training data. For each fold, we found that there are about 3-4 task sentences, thus a single misclassification of a task sentence leads to sizeable reduction in precision and recall which in turn results in a low F-measure value.

Comparison of classification performance among three classifiers and between the term-based and LSA-based features.
When balanced accuracy is the evaluation measure, SVM and random forest consistently yield similar performance when using either the LSA reduced feature set or the original feature set, although, random forest yielded a slightly higher performance compared to SVM using the LSA reduced features set. This seems to suggest that for balanced accuracy and employing the original features, one can choose between SVM and random forest, and if one decides to use the LSA feature set then random forest is to be preferred. Moreover, notice that the numerical value for balanced accuracy is higher than F-measure. Balanced accuracy can be increased by the accuracy of the dominant class, in this case the nontask class.
This classification example reveals the many issues that one may face in building a suitable classification model. First is the central role of features in classification. Second is how to model the relationship between the features and the class membership. Third is the crucial role of choosing an appropriate evaluation measure or performance metric. This choice should be guided by the nature of the problem, the objectives of the study, and the amount of error we are willing to tolerate. In our example, we assign equal importance to both classes, and we therefore have slight preference for balanced accuracy. In applications where the misclassification cost for the positive class is greater than that for the other class, the F-measure may be preferred. For a discussion of alternative evaluation measures see Powers (2011).
Other issues include the question of how to set a cutoff value for the evaluation measure to judge whether a model is good enough. A related question is how much training data are needed for the classification model to generalize well (i.e., how to avoid overfitting). These questions are best answered empirically through systematic model evaluation, such as by trying different training sizes and varying the threshold, and then observing the effect on classifier performance. One strategy is to treat this as a factorial experiment where the choices of training size and evaluation measures are considered as factor combinations. In addition, one has to perform repeated evaluation (e.g., cross-validation) and validation. Aside from modeling issues there are also practical concerns such as the cost of acquiring training data and the interpretability of the resulting model. Classification models with high predictive performance are not always the ones that yield the greatest insight. Insofar as the algorithm is to be used to support decision making, the onus is on the researcher to be able to explain and justify its workings.
Classification for Job Information Extraction
For our work on job task information extraction three people hand labeled a total of 2,072 out of 60,000 sentences. It took a total of 3 days to label, verify and relabel 2,072 sentences. From this total, 132 sentences were identified as task sentences (note that the task sentences were not unique). The proportion of task sentences in vacancy texts was only 6%. This means that the resulting training data are imbalanced. This is because not all tasks that are part of a particular job will be written in the vacancies, likely only the essential and more general ones. This partly explains their low proportion.
Since labeling additional sentences will be costly and time-consuming we employed a semisupervised learning approach called label propagation (Zhu & Ghahramani, 2002). For the transformation and dimensionality reduction we respectively constructed the DTM and applied LSA. Once additional task sentences were obtained via semisupervised learning we ran three classification algorithms, namely, SVM, random forest, and naive Bayes. Instead of choosing a single classifier we combined the predictions of the three in a simple majority vote. For the evaluation measure we used the Recall measure since we wanted to obtain as many task sentences as possible. Cross-validation was used to assess the generalization property of the model. The application of classification resulted to identification of 1,179 new task sentences. We further clustered these sentences to obtain unique nursing tasks since some sentences pointed to the same tasks.
Model Reliability and Validity
We set out to build a classification model that can extract sentences containing nursing tasks from job vacancies. Naturally, a subsequent step is to determine whether the extracted task sentences correspond to real tasks performed by nurses. An approach to establish construct validity is to use an independent source to examine the validity of the classification. Independent means that the source should be blind from the data collection activity, initial labeling procedure, and model building process. Moreover, in case ratings are obtained, these should be provided by subject matter experts (SMEs), that is, individuals who have specialist knowledge about the application domain. If found to be sufficiently valid the extracted sentences containing job tasks may then be used for other purposes such as in job analysis, identifying training needs or developing selection instruments.
We enlisted the help of SMEs, presented them the task sentences predicted by the text classifier, and asked them to check whether the sentences are actual nursing tasks or not so as to be able to compute the precision measure. Specifically, we compute precision as the ratio of the number of sentence tasks confirmed as actual nursing tasks to the total number of sentences tasks predicted by the model. We reran the classification algorithm in light of the input from the experts. The input is data containing the correct label of sentences which were misclassified by the classifier. We performed this in several iterations until there was no significant improvement in precision. This is necessarily an asymmetric approach since we use the expert knowledge as the “ground truth.”
A more elaborate approach would be to compare the extracted tasks from vacancies to tasks collected using a more traditional job analysis method, namely a task inventory. The task inventory would consist of interviews and observations with SMEs to collect a list of tasks performed by nurses. Based on this comparison, a percentage of tasks would be found in both lists, a percentage of unique tasks would only be found in the task inventory, and a percentage of unique tasks would only be found in the online vacancies. A high correspondence between the list of tasks collected by text mining and the list of tasks collected in the task inventory (which would be considered to accurately reflect the nursing job) could be taken as evidence for convergent validity. Conversely, one could establish discriminant validity, or a very low correspondence with so-called bogus tasks that are completely unrelated to the nursing job.
We apply the less elaborate approach by first training a classification model, making predictions using the model, and presenting the task sentences to an SME. The expert judged whether the sentences are actual nursing tasks or not. The precision measure was used to give an indication of the validity of the model. The first round of validation resulted in a precision of 65% (precision range: 0% to 100%) and we found out that some of the initial labels we assigned did not match the labels provided by the independent expert (that is some of the labels in the initial labels were judged to be erroneous by the expert). In light of this, we adjusted the labels and conducted a second round of validation in which precision increased to 89%. This indicates that we gained classification validity in the classification model. A total of 91 core tasks were validated. Table 3 contains validated tasks under the basic care and medical care clusters. In practice, it is difficult to obtain 100% precision since forcing a model to give high precision comes at the expense of sacrificing its recall. High precision and low recall imply the possibility that many task sentences will be dismissed though we can put more confidence on the sentences that are labeled as a task. As a last note, TC models are seldom static, that is, as new documents arrive, we have to continually assess the performance of the model on new observations and adjust our model if there is significant degradation in performance.
Basic Care and Medical Care Core Nursing Tasks Extracted From Nursing Vacancies by Applying Text Classification.
Conclusion
This article provided an overview of TC and a tutorial on how to conduct actual TC on the problem of job task information extraction from vacancies. We discussed and demonstrated the different steps in TC and highlighted issues surrounding the choices of features, classification algorithms, and evaluation metrics. We also outlined ways to evaluate and validate the resulting classification models and prediction from these models. TC is an empirical enterprise where experimentation with choices of representation, dimensionality reduction, and classification techniques is standard practice. By building several classifiers and comparing them, the final classifier is chosen based on repeated evaluation and validation. Thus TC is not a linear process; one has to revisit each step iteratively to examine how choices in each step affects succeeding steps. Moreover, classifiers evolve in the presence of new data. TC is a wide research field and there are many other techniques that were not covered here. An exciting new area is the application of deep learning techniques for text understanding (for more on this we refer the reader to Maas et al., 2011; Mikolov, Chen, Corrado, & Dean, 2013; X. Zhang & LeCun, 2015).
TC models are often descriptive as opposed to explanatory in nature, in the sense that they capture the pattern of features and inductively relate these to class membership (Bird, Klein, & Loper, 2009). This contrasts with explanatory models whose aim is to explain why the pattern in features leads to the prediction of a class. Nevertheless, the descriptive work can be of use for further theory building too as the knowledge of patterns can be used as a basis for the development of explanatory models. For example, in the part about feature selection we found out that the word sicherstellung (to guarantee or to make sure) is useful in detecting sentences containing nursing tasks. Based on this we can define the concept of “task verb,” that is, a verb that is indicative of a task in the context of job vacancy. We could then compile a list of verbs that are “task verbs” and postulate that task verbs pair with noun or verb phrases to form task sentences. Further trials could then be designed to validate this concept and establish the relationship between features and patterns. In this way, we are not only detecting patterns but we also attempt to infer their properties and their relationship to class membership.
Whether a descriptive model suffices or whether an explanatory model is needed depends on the objectives of a specific study. If the objective is accurate and reliable categorization (e.g., when one is interested in using the categorized text as input to other systems) then a descriptive model will suffice although the outcomes still need to be validated. On the other hand, if the objective is to explain how patterns lead to categorization or how structure and form lead to meaning then an explanatory model is required.
In this article we tried to present TC in such a manner that organizational researchers can understand the underlying process. However, in practice, organizational researchers will often work with technical experts to make choices on the algorithms and assist in tweaking and tuning the parameters of the resulting model. The role of organizational researchers then is to provide the research questions, help select the relevant features, and provide insights in light of the classification output. These insights might lead to further investigation and ultimately to theory development and testing.
Finally, we conclude that TC offers great potential to make the conduct of text-based organizational researches fast, reliable, and effective. The utility of TC is most evident when there is a need to analyze massive text data, in fact in some cases TC is able to recover patterns that are difficult for humans to detect. Otherwise, manual qualitative text analysis procedures may suffice. As noted, the increased use of TC in organizational research will likely not only contribute to organizational research, but also to the advancement of TC research, because real problems and existing theory can further simulate the development of new techniques.
