Abstract
Keywords
Introduction
Interest in the problem of classifying and coding occupations can be traced back at least as far as the 1850 population census, and in subsequent decades it became a major focus of attention by census officials on both sides of the Atlantic (Conk, 1978; Woollard, 1999). Reasons for this are not hard to find. Whether it was tracking the overall progress of industrial society, urban industrial specialization, the social mobility of immigrants or finding surrogate measures of wealth and social class, occupational information was one of the most valuable tools available to census statisticians and officials in other government agencies (Edwards, 1933). However, the quest for a satisfactory system of classification for the U.S. census was still in progress at the start of the 20th century (Hunt, 1909).
Much more recently, the growing availability in digital form of large historical population data sets containing individual-level data, anonymized or otherwise, has re-kindled interest within several academic disciplines in the seemingly rather dry topic of occupational classification. However, this same availability of large data collections, which are now open to systematic evaluation using powerful database technologies, in ways that were infeasible until relatively recent times, has begun to raise a number of questions. These relate not only to the 19th century census enumeration practices and published statistics based on the resulting 19th century census figures but also to the validity, reliability, and “fitness-for-purpose” of electronic coding, both of occupational data transcribed from manuscript census schedules and of other measures derived partly or wholly from these data. It is important that such questions are examined sooner rather than later, as funding bodies are increasingly relying on the availability of large secondary data sets, as part of a drive for efficient use of public monies for research. Yet, in most cases, much less effort has been expended to date in determining the quality of these data sets and their suitability for different types of analyses, than would ideally be the case, given they are intended to form an accepted part of general research infrastructure internationally.
It is important to stress at the outset that this is not primarily a criticism of the leading U.S. and European data archives and centers, such as the member organizations of the North Atlantic Population Project (NAPP), which have undertaken sterling work in making large demographic data sets accessible to researchers (Minnesota Population Center, 2008). Rather, a distinction needs to be made between four potential sources of problems in these data sets, two of which relate to the original manual collection and processing and two to the much later phase of conversion into digital form. The first derives from inherent shortcomings in the collection of the original data, while the second relates to the methodology used for the subsequent manual classification/tabulation, whose end products were the published census tabulations. The third involves transcribing and data entry errors in the digitization process and the fourth is the digital equivalent of the manual classification problem, namely, how data fields within the digital census records are to be coded accurately and consistently. A further question, which has been examined in the course of several large projects, is how coding consistency can be extended beyond national boundaries to encompass international comparisons, although this topic goes beyond the scope of the present discussion.
As there are numerous data fields, even in late 19th century census records (the digital version of the U.S. 1880 census has about 90 fields, for example), a detailed examination of these four types of problems in relation to each data element in turn would be a major undertaking. The focus of attention here will therefore be restricted to an examination of quality issues surrounding the coding and classification of occupational and related industrial categories, and how they can be addressed, as these are among the key types of information utilized by researchers (e.g., Hirschman & Mogford, 2009; Sarkar, 2009). The topic will be further restricted mainly to consideration of the U.S. 1880 census because it is the only complete count census available for that country in digital form and thus it is becoming widely used as a reference point for all types of historical demographic analysis, even when earlier and later sample census data sets are also used in combination with it (Ruggles et al., 2010; Sobek & Dillon, 1995). That said, questions have been raised about the possible manipulation of occupational data, at the enumeration and processing stages, for the young, the elderly, and married women in this census (Carter & Sutch, 1996). It is also recognized that the 1880 time point was part of an extended process of “learning by doing” in the planning and execution of decennial census-taking, so it must also be set within this broader context.
This article begins by identifying a number of problems inherent in the approach to coding of occupations adopted by NAPP for the 1880 census, and by extension for the earlier and later census samples the latter project has also made available. These problems point to the requirements for a new coding system that removes the limitations identified. The main body of the article explains the design and implementation of this system that provides an operational basis for commencing the long-term and difficult process of re-coding industry sector codes in historical population censuses, which must necessarily be undertaken using non-census sources. In the process, it will also be made clear that the new system can be used to classify and standardize employment and occupational data from non-census sources independently and additionally to its deployment in support of future work on re-coding of industry sector information in large census data sets.
Occupational and Industrial Sector Coding in the 1880 Census
At the outset, some examples serve to indicate why such a study needs to be undertaken. The first relates to the U.S. railroad sector, a very important contributor to the processes of 19th century industrialization (Chandler, 1965; Vance, 1995). For 1880, there are two independent sources of railroad employment data. The first is the person-level records from the population census, where individuals could identify themselves as working in a railroad-related occupation (U.S. Census Office, 1883a). The second is a quite different special report on transportation, where each railroad company was asked to notify the Census Bureau of the total number of people in its employment (U.S. Census Office, 1883b). Although a few local short-line railroads in isolated areas doubtless escaped enumeration, all lines of any substance could be identified relatively easily, in terms of ensuring quite comprehensive data coverage. As the companies are all named in the report, it is also easy now, as it was then, to check the figures against other reports made to State Railroad Commissioners in the different states. Such checks suggest that considerable confidence can be placed in the reported figures, as state officials with local knowledge would likely have been able to identify any attempts at systematic misrepresentation in the data. The final employment total from the Census special report gives 418,957 workers in the railroad sector.
In the digital 1880 population census from NAPP, the original transcribed text strings describing occupations associated with individual records have been standardized and coded using a variant of the standard Historical International Standard Classification of Occupations (HISCO) scheme (explained further below) into many hundreds of numerical categories represented by the US80A_OCC variable. A second coding of occupations places them on a 1950 basis using the variable US80A_OCC50US, to try and provide a consistent classification across multiple censuses, although this particular variable will not be examined here further. Of the US80A_OCC variable values, 18 codes refer directly to different aspects of steam railroad work and a further three doubtless include railroad employees, but may also overlap with horse-drawn street railroad employment, for example, code 36010—unspecified conductors. Counting the records assigned to the 18 codes across the entire census yields a total of 237,480 workers. Adding in the three less specific categories raises the total to 251,490. The first figure very closely matches that of 236,058 for 1880 given by Edwards (1943, p. 109) in his classic article on occupational trends in the U.S. census, though there is no documentation on exactly how that specific figure was obtained. This does, however, suggest that the recently developed digital coding system for occupations closely reproduces earlier manual findings. This would support the view that present day transcription and coding has not introduced any significant new sources of error, a similar finding to that reported by Woollard for work on the historical censuses of the United Kingdom (Woollard, 1999). That said, it is apparent that the railroad employment total based on the 18 codes is only 56.7% of the total from the 1880 Special Report and adding the other three codes has relatively little effect. Despite the substantial difference in these two totals, there is no clear evidence from the literature that this rather important disparity has ever been noticed or made the subject of further investigation.
One possible approach to resolving the problem would initially appear to be to use another variable from the 1880 NAPP data set, namely, the industry classification (variable US80A_IND50US), though this classification is also on a 1950 basis for comparative purposes (cf. Ronnander, 1999). This variable has a code (506) for “railroads and railway express service,” which has already been used in the published literature as part of a comparison of employment changes in industrial sectors over time, although in this case the IPUMS census samples, which use the same industrial codes as the full NAPP data set, were used (Hirschman & Mogford, 2009; Ruggles et al., 2010). The industry code only identifies 266,659 railroad employees in 1880. This is a modest increase over the occupation code count, but it is less apparent what the derivation of this figure is, as it does not correspond to the earlier calculations by Edwards noted above. Although neither the NAPP documentation nor the standard reference on NAPP occupational coding make this clear (Roberts, Woollard, Ronnander, Dillon, & Thorvaldsen, 2003), the industry code is necessarily very largely imputed from the occupation data by the NAPP project and is not an independent and additional source of data on individuals, as there is no column in the manuscript census schedules for industrial sector. The effect of this imputation can be seen by cross-referencing the occupational and industrial codes attached to individual records. Taking an example of five industrial states (Pennsylvania, etc.), which have a total of 61,616 individuals given industry code 506, of these almost 88% have one of the 18 railroad occupation codes, a figure that rises to 91%, if the three less specific categories are included. Working in the other direction, nearly 99% of individuals with one of the 18 occupation codes have an industrial code of 506, or nearly 97% if the wider definition is used. Thus, in the vast majority of cases, the industry code provides no additional information over the occupation code. Judging by detailed examination of the original transcribed occupation text strings (which somewhat negates the value of having a code), the limited number of cases where the industry code does provide additional information reflect situations where additional non-standard text in the string in question allowed a more precise industrial sector attribution to be made. For example, a worker might be described as “boilermaker in the B&O shops,” which would identify him as a Baltimore and Ohio Railroad employee for the industry code, but under occupation he would be standardized to just “boilermaker.” The data coders have thus endeavored to make maximum use of any data present in the census. Despite this painstaking work, still only 63.7% of the railroad workforce can be identified on an individual basis, leaving in excess of 150,000 workers unaccounted for in this industrial sector alone. Similarly, problematic findings have been reported for the anthracite coal mining sector, a large employer in Pennsylvania, though in this case, use of occupational data gave better results than the industry variable (Healey, 2011).
This earlier study traced the source of the discrepancies in the employment counts to the distinction between workers in generic occupations, such as blacksmiths and machinists, and those in industry-specific occupations, such as coal miners or railroad brakemen. The large numerical impact of these discrepancies does not appear to have been recognized in previous studies devoted to the problem of occupational coding. In general, industry-specific occupations were quite accurately recorded in the population census, so industry sector can usually be imputed correctly for these workers. However, the vast majority of generic workers in 1880 did not give census enumerators details of the industrial sector in which they worked, so it is not possible to impute the industry correctly for these individuals
A brief analysis of the occupational/industry code combinations in the entire 1880 data set makes clear how serious a problem this is. Taking the case of blacksmiths, 99% of the 177,193 individuals are given an industry code of 817 for “miscellaneous repair services.” Only 373 blacksmiths are coded as railroad employees, 142 to different iron and steel related codes, and a mere five to coal mining. Such figures are entirely incorrect and extremely misleading. For example, a single anthracite mine (the Diamond) out of more than 100 in one of the four anthracite coalfields, employed four blacksmiths and three blacksmith’s helpers in mid-1880 (Diamond Payroll, 1880), so across both the anthracite and bituminous mining sectors, thousands of blacksmiths would have been employed and the same would have applied to the other sectors named above. For machinists, the situation is similar. Of 97,424 workers in this occupation, 97.3% were coded to the mysterious category of “miscellaneous machinery” (code 358) and only 1,177 or 1.2% to the railroad sector. However, the 1880 census special report has a specific breakdown of numbers of machinists (unlike blacksmiths), and it states that 22,766 of them worked for the railroads across the United States (U.S. Census Office, 1883b). This means that more than 21,000 of this sub-group are misclassified. Likewise, the Mine Inspectors’ reports for 1880 record 813 “outside mechanics” employed at mines in the northern anthracite field (calculated from data in Inspectors of Mines, 1880). While this definition may not exactly equate to machinists, according to the NAPP data set there were no machinists at all who worked in coal mining.
Prior to 1880, the enumerators’ instructions were no more specific than in 1880, despite a minor caution against using “machinist” if a more precise description could be given (U.S. Census Office, 1870, 14). After 1880, the instructions for 1890 and 1900, which are largely the same in both years, suggested in places the need for accurate qualification of job titles, by means of isolated examples such as “railroad laborer” or “carriage blacksmith,” but there is no clear recognition of the systematic need to distinguish generic from industry-specific occupations. This can be seen in the complete omission of generic trades (as opposed to laborers) from the list of steam railroad occupations (U.S. Census Office, 1900, pp. 32, 36). As a result, the same problems with underreporting are to be expected both in the earlier and later censuses. This is unequivocally demonstrable in the case of 1890, as there was also a special report on transportation in this year (though not in 1900). The special report gives a figure of 750,017 employees, but the Edwards Report only counts 462,213 (Edwards, 1943, p. 109; U.S. Census Office, 1895, p. 130). This special report figure for 1890 is also greatly in excess of Edward’s 1900 figure of 582,150, at a time when railroad employment was still expanding, so the underreporting issue was still not resolved by this date. Although attention has been focused on generic skilled tradesmen in the previous examples, unskilled general laborers also make a substantial contribution to the overall problem, because of their comparatively large numbers, and the likelihood that their industrial sector was also not recorded by the enumerators. Further to this, laborers in irregular employment, say in railroad construction, may well not have identified themselves as part of the railroad industry, even if specifically questioned to that effect.
The much wider implication of these problems is that the NAPP/IPUMS 1% sample census data sets for other census years after 1850 (excluding 1890) are also subject to the same difficulties of interpretation and inaccurate assignment of sample individuals to industrial sectors. Any research findings based on these specific industrial codes may therefore be very much in error, and these errors are unlikely to be consistent between different industrial sectors. The potential impact on studies of inter-sectoral mobility is substantial. The same applies to inter-censal analyses of changing occupational/industrial structure based on aggregate statistics, or of detailed occupational mobility, based on linked samples derived from NAPP or IPUMS data sets. This is apparent, because there is no means of determining from census records alone, whether occupational information about given linked individuals was recorded in the same way in successive censuses, so workers may appear to be railroad employees in one census but not the next, when their employment status did not actually change. Further to this, sampling from undifferentiated occupational groupings, when those same groupings actually contain different sub-populations of individuals in different industrial sectors, may be a source of concealed bias in statistical studies. For example, it has already been shown that railroad machinists and blacksmiths in Baltimore in 1860 had different socio-demographic characteristics than their non-railroad counterparts (Healey, Thomas, & Lahman, 2013). It is therefore most important that these coding issues are more widely discussed and analyzed, to prevent inappropriate analyses being undertaken that generate misleading or false results. A further inference is that historical census data sets, standardly viewed as “givens” for secondary data analysis, should more accurately be viewed as “works in progress,” resources whose data quality needs to be enhanced progressively over time, by means of comparison with other sources, to increase the confidence that can be placed in analytical results derived from them. This is not a welcome finding for research funding bodies, who would doubtless have wished that researchers could capitalize on their past investments in large data sets without the need for ongoing expenditure on quality improvement. It also leaves some individual researchers in a quandary, as it is now clear that the coded data presently available cannot support certain types of analyses that would previously have been deemed viable. They can either restrict the scope of their work (e.g., by avoiding use of industrial sector codes) or shoulder the rigorous additional burden of making the required data quality enhancements using non-census sources. While the latter may be a feasible strategy for well-resourced work with limited geographical coverage, it is infeasible for individuals wishing to engage in larger-scale studies. Also, in the absence of any agreed approach to the use of non-census sources or how any re-coding might be undertaken, there is serious risk of incompatibilities quickly arising between studies, which will greatly hinder future comparative work. Where studies only make use of very broad occupational categories (e.g., Ferrie, 2005), the impact of these detailed problems may be lessened, but it can no longer be assumed that they do not exist.
Requirements for a New Coding System
To address this unwelcome situation in a systematic manner, a new approach is required to the problem of quality enhancement of existing historical census data sets, such as the NAPP 1880 census. This involves several initial steps. The first of these is to provide an overview of the main types of non-census sources that may eventually contribute to the re-coding process. The second is to evaluate what new developments, in terms of coding capabilities, are required to mesh together census and non-census sources. The third is to identify a suitable computational methodology or methodologies that will support these new capabilities. The fourth is to identify operational considerations that could facilitate the take-up of new coding system capabilities, and finally, there is the need to outline future possibilities for systematic re-coding projects (e.g., of specific industrial sectors) of sufficient substance to demonstrate unambiguously the full nature and extent of the data quality problems for the sectors in question, and to provide guides to assist subsequent projects aimed at other sectors. The main emphasis of the present discussion will be on the first three of these steps, followed by a brief commentary on the remaining two stages, the implementation of which lies, at least in part, in the future.
The first question to address is which other non-census sources are available to assist with census (re-)coding. A range of these can be identified in the U.S. context, but they vary widely in their temporal, geographical, and sectoral coverage and indeed their degree of comprehensiveness, even for specific locations and time points. Among the most obvious candidates are city directories, company payrolls, marriage and death records, and naturalization records. Less obvious candidates would include the harrowing industrial accident records found in state railroad commission reports and mine inspectors’ reports. While space precludes a detailed survey of these sources, several brief comments serve to highlight relevant issues. The census has the enormous advantage of relative geographical comprehensiveness over a broadly comparable time interval (the concept of a precise census date was not well-developed in earlier years), and provides information on age, family, and household status, occupation and birthplace. Marriage and death records will provide a subset of this information possibly with links to parental names. Company records, such as payrolls, being employment-focused, lack much of this information, including age-related data (though this may be found in employee card indexes). However, this is offset by the detailed work history information they contain. Directories, though largely confined to urban areas, have varying degrees of comprehensiveness for the populations they served, lack age or family data, but provide addresses and often contain valuable employment-related information for multiple time-slices falling between census years. The potential research benefits of being able to combine data about individuals over time and space from these and other relevant sources are easy to see, though the practical problems of achieving the required data linkage in a reliable manner may be quite another matter.
To examine some of these sources in more detail, experience with city directories, for example, suggests they are most informative for occupational purposes in the 1850s to 1870s, rather than in later years, and the larger the city the less informative, owing to pressure on space in individual volumes. By more informative, is meant more likely to provide not only an occupation for each individual, but also an industrial sector or even specific manufacturing establishment/company department (e.g., foreman of the car repair shop of a specific named railroad). Comprehensiveness of population coverage probably increased over time, as directory compilers became more organized and better funded, though systematic studies of this are largely lacking (Goldstein, 1954). Payrolls are usually much more detailed, though far more sporadic in space and time. Thus, only a small fraction of 19th century anthracite mines have surviving payrolls, and regrettably even fewer railroads, but the documents that do survive, will reveal much finer job sub-divisions than “coal miner” or “railroad hand.” They may also indicate the department of the company in which employees worked, and provide information on how they were paid (piece-work or hourly) and the regularity of work over shorter or longer periods, depending on the length of surviving records. The clear advantage of payroll records, and indeed industrial accident records, because the information is firm specific, is that they are guaranteed to address the problem of identifying generic workers in specific industrial sectors at particular dates. This is not standardly the case for city directories, although some early volumes do contain a good deal of the requisite information.
Setting aside questions about relative ease of processing of printed versus manuscript sources, and the major topic of nominal record linkage between different sources (for a review, see Winkler, 2006), which are beyond the present scope, key requirements for an occupational coding system that facilitates re-coding of census records using non-census data can now be identified, based on the range of information that may be available in different types of non-census sources. First, and most importantly, the system must enable workers in generic occupations to be “tagged” with their specific industrial sector, where known. Second, it should extend beyond the rather general occupational categories favored by census enumerators and “genealogical” sources, such as marriage records, to encapsulate the greater range of employment information provided by payrolls, industrial accident records, and many early directories. This information includes detailed job titles, major and minor sub-divisions within companies and whether employees were engaged in construction work or activity related to production/operation. Thus, the system should be able to distinguish blacksmiths involved in railroad construction from those employed in the operation of rolling mills in the iron industry. Such distinctions are impossible to make with the version of the HISCO coding system used for the U.S. 1880 census and the samples from earlier and later censuses. This is the main reason why a new system is required. Further to this, however, if non-census sources are to be used, it is sensible to abstract as much relevant information on employment structure from them in a single pass as possible, to avoid the need to keep referring back to them for more detailed information. In this sense, the resulting occupational/industrial code then serves as a kind of employment structure index to the archival source (e.g., a payroll), in addition to its main function as a classificatory device. This proves to have wider implications, as will be seen below. At the same time, the wide usage of the HISCO system means that backward compatibility with it should also be provided by the new system. Relating detailed job titles to the more general HISCO categories also obviates the necessity for a separate look-up table of individual titles. Finally, unlike some of the older systems developed in the pre-Internet era, it will be assumed that the system can take full advantage of a range of readily available digital and database-related technologies, including Web connectivity.
Focus of the New Coding System
As Wrigley has correctly observed, there is no right or wrong in terms of coding systems, but each will have a particular class of problems to which it is especially well suited (Wrigley, n.d.). In his case, Wrigley adopted a focus on the distinction between primary, secondary, and tertiary (PST) sectors in the economy, because of a particular interest in the changing relative importance of these sectors over the long term as the Industrial Revolution progressed. Herschberg, in contrast, seems to have sought an all-encompassing census coding system, with a certain focus on industrial sectors, though his precise aims are not very clearly articulated, and there is no recognition of the problems caused by lack of specificity in the occupational/industrial sector enumeration of generic workers (Herschberg, 1976). In the present system, a particular, though not exclusive, focus is on the question of occupational and geographical mobility of heavy industrial workers. This has many facets, as workers can change jobs within and between industrial concerns in the same sector or utilize existing generic skills in new ways by changing industrial sector, for example, a machinist moving from the mining to the railroad industry. Such occupational movements, which may be upward, horizontal, or occasionally downwards, in terms of the job and remuneration hierarchy, may or may not be accompanied by geographical mobility. Changes of location in pursuit of career advancement have been characterized as a major feature of the “American Dream” (cf. Ferrie, 1995, 1999) and thus the problem of mobility touches on a wide range of debates about immigration and 19th century economic growth (Thomas, 1973).
A focus on occupational mobility has also been the preferred approach in relation to the U.S. Department of Labor Dictionary of Occupational Titles (Miller, Trieman, Cain, & Roos, 1980). A substantial body of work on coding systems, both for the U.S. census and for inter-agency work within the U.S. Government, undertaken in the first half of the 20th century and summarized by Palmer (1939), concluded that rigorous and consistent classification based on a distinction between skilled and unskilled work was infeasible. In contrast, both Herschberg (1976) and Morris (1990), in the U.K. context, have stressed the need for any coding system to reflect both occupational characteristics and industrial sector affiliation. This is particularly significant, as the NAPP-modified HISCO system (Roberts et al., 2003) does not do this as part of its structure, though some specialized occupations will tend to be associated with particular industrial categories.
General Design Criteria for the New System
There are four general design criteria for the system that need to be explained prior to detailed treatment of the individual components of the overall structure. The first derives from the requirement stated above that the system must be able to code both census/vital registration records and occupational data from industrial/company records. The HISCO system was
Hierarchy of Levels in the New Coding System.
Individual-Level Sub-Codes
From Table 1, it is apparent that the numbering of levels runs from the highest at 8 (the most general) to 1 (the lowest and most disaggregated).
The lowest three levels (1-3) are designed to provide compatibility with the NAPP-modified HISCO coding system, while also enabling this system to be extended substantially to include the much wider range of specific occupations found in non-census records. The maximum possible compatibility is sought, subject to the constraints of a strictly hierarchical coding system. While HISCO very largely meets these requirements, there are occasions where lack of consistency or other considerations required slight modifications to be made or minor re-naming to take place. The effects of these changes are limited, however, so the overwhelming majority of HISCO codes can be readily identified within the new system. However, some modifications that do require highlighting at the outset are that the codes as they appear on the NAPP list have an additional digit to the left and two to the right of the code in the present system. Partly, as a consequence of this, the division of the code into levels for hierarchical decomposition differs slightly from that in the original HISCO scheme. In the latter, codes have the form 9.99.99, where 9 is the placeholder. The highest level (one digit) represents major groups, the middle-level (two digits) minor groups, and the lowest level (two digits) the individual occupational categories (Van Leeuwen et al., 2004). In the present scheme, the structure is 09.9.99900, using zeroes to show the new digits, but, as can be seen, the five HISCO digits remain, though divided slightly differently.
Examining the new expanded structure above,
Example Occupational Sub-Tree Leading to Detailed Categories of Rolling Mill Workers.
Expanding the Range of Occupational Titles—Coding Issues and Data Sources
Within the coding system, the occupational description that accompanies each code at the lowest level may be derived from a variety of sources. The more generic job titles correspond to those in the NAPP/HISCO system and are readily identifiable as a result. However, as previous studies have pointed out, the NAPP/HISCO coverage of occupational types found within different industrial sectors is very variable. Thus, there are 18 different NAPP/HISCO codes specifically associated with railroads, but only two codes for the mining industry (71120 “miners,” which unhelpfully includes both coal and ore miners, and the general and very rarely used 71190 “others working in mines and quarries”). The rather unclear code 71200 “mineral or stone treaters” might also be partially relevant, but it is not obvious from the code list whether this relates to mining activity as normally understood (see below). This comprehensive failure to recognize the occupational complexity of the mining industry was particularly striking to the present writer, because earlier unrelated analysis of anthracite mine payrolls in the 1880s and 1890s had already shown that about 200 distinct types of work could be identified without any difficulty. More recent work on payrolls from the 1860s, as this coding system was developed, and mine inspectors’ reports from the 1870s onward has further extended this list of job types to around 300 in total, not all of which are yet incorporated in the system (Healey, 2013).
Also, in an attempt to clarify the use of the code 71200 in the NAPP 1880 census data set, the detailed occupational transcriptions for workers with this code were examined on a sample basis. This revealed interesting and important lessons in the present context, both for those engaged in the coding of census occupations, and subsequent users of the coded data. As noted earlier, a unique characteristic of anthracite mines was the requirement to prepare the coal for market in large “breakers” on the surface (Hudson Coal Company, 1932). This meant that large numbers of boys (and some elderly ex-miners) were employed at each anthracite mine to help with this preparation process, which was only partially mechanized in the 19th century. Many thousands of these “breaker boys” or “slate pickers” (so-called because they removed rock or “slate” from the coal before it was loaded into railroad cars) were employed in the anthracite coalfields, but not in the bituminous mining regions, where breakers were not required. In the NAPP data set, these boys have almost all been coded to 71200. This separates them from “miners” per se and their NAPP/HISCO occupational description tends to mislead rather than inform, as they are never described as “mineral treaters” in the mining literature. The rationale for this coding decision is also much more apparent following the above explanation of the work of breaker boys, than it is when examining a code list to identify potential mining industry employees, as one group among many chosen for analysis. Put another way, if it is necessary to utilize detailed industry knowledge and the original occupational transcriptions to understand the use of a code, then it is probably not a very effective numerical shorthand. Further to this, from an industrial sector perspective, use of 71200 confounds coal mining-related activity with unrelated stone-dressing in quarries or mineral ore processing to an unknown degree in any county data set, though it is fortunate that anthracite is only mined in a very limited number of U.S. counties. In contrast, these breaker boys are always reported as an integral part of the anthracite mining industry, both in mine inspectors’ reports and in company payrolls themselves. The present system therefore identifies them under the code 10112209900220, which specifies that they are anthracite industry employees paid by the day to work outside the mine in the coal breaker. To avoid perpetuating the confusion generated by the 71200 code, a different code not used by NAPP/HISCO, namely 99000, has been used here, to provide the basis for sub-codes to match the lengthy list of job types found in coal breakers.
Although the NAPP/HISCO codes are more informative and differentiated for the railroad than the mining sector, the 18 codes still only represent a small fraction of the job types actually found in railroad employment. They are excellent for trainmen, who would fall under the “conducting transportation” heading, but not for the generic trades more prominently found in railroad shops, who were classified under “motive power.” To remedy this deficiency and provide a more balanced coverage of occupational types across the sector, two main sources were used to provide a list of about 300 job titles. These include the published payroll lists of the Baltimore and Ohio Railroad, which cover the years 1842 to 1857 (Baltimore and Ohio Railroad Company, 1842, 1852, 1858) and the reports of the Pennsylvania Bureau of Industrial Statistics (PBIS; Secretary of Internal Affairs, 1877, 1881). The latter are especially important, as they contain details of the occupational structure, including job titles, of large numbers of different railroads, large and small, within the state during the 1870s and 1880s. Although no set of listings can be considered exhaustive, comparison of the returns for the different railroads in Pennsylvania with those of the Baltimore and Ohio in Maryland and Virginia provides an excellent basis for the railroad job titles to be coded in the present system. Most importantly, unlike NAPP/HISCO, this list is not biased toward the trainmen, and provides good coverage not only across all the three main departments of railroad operation but also extending to categories of railroad construction workers, as these were recorded in the large payroll list of the Baltimore and Ohio in 1857.
The PBIS returns are not limited to railroad reports, and they also provide details of employment in bituminous coal mines, in primary iron and steel manufacturing concerns, and in rolling mills. These have been utilized within the coding system, and, as would be expected, a distinction is made between workers with otherwise similar titles, depending on whether they were employed at blast furnaces or in foundries and so on.
Overall, across the various sectors, the system currently has 2,372 occupational entries, each of which has the eight levels of the hierarchical code structure attached, making nearly 19,000 code values, though the number of different job titles is much smaller (814), because generic occupations, such as machinist, will be found under several sector and sub-sector headings. Any user would only use the 2,372 Level 1 codes; the others are used by the data warehouse in which the code system is embedded. In any primary data source, a number of individuals will have incomplete employment attribution, for example some blacksmiths may be known to be employed by a given iron and steel works, but it is not stated whether they worked at the blast furnace or the foundry. To allow for this, every level of breakdown enables workers without a more precise lower-level classification to be coded as “unspecified” and the numbering convention is standardized for this (codes end in “95”). Examples would be “Inside Mine Worker Unspecified Occupation” or “Bloomery Worker Unspecified Occupation.” Further to this, there are a group of codes for generic occupations in industries other than those currently handled in detail. These codes still convey slightly more information than their HISCO equivalents, as identified workers in these occupations in the main heavy industry categories have already been separated out. For example, they allow coding of individuals, who are recorded in city directories as carpenters in furniture manufacturing plants. This flags the existence of some additional information in the original sources about these workers, should that be needed. If such industries as furniture are specifically coded in future, these workers can be retrieved from the data set and given more precise codes at that time. Where no industrial sector attribution is available, separate codes again are available to cover individuals who are simply recorded as “carpenters” or “blacksmiths,” and there are the usual “catch-all” codes for individuals who lack the information to enable them to be otherwise usefully classified.
While the hierarchical structure closely guides the coding process, the opportunity has been taken to standardize certain code components to facilitate the retrieval of individuals with particular employment characteristics that may span multiple occupations. For example, all titles that include the word “helper” or “assistant” end in the digit “1” at the lowest level. However, as an assistant master of machinery is a very different level of job than a blacksmith’s helper, the fixed number of digits in each code means that the simple application of a format mask to the code allows assistants in supervisory posts to be distinguished easily from helpers in standard trades. A similar convention applies to apprentices, all the codes of which end with the digit “2.” This approach equates to that used by Herschberg, but differs from HISCO, where such qualifying information would have to be found in the separate STATUS code.
Operational Use of the New Coding System
As noted at the outset, the new coding system is a key component of a future process of what might be termed
Preliminary findings indicate the feasibility of abstracting many thousands of new data records from city directories, though accurate matching to census records is a resource-intensive process. It has also been found, as was anticipated, that managing lengthy numerical codes in the course of manual coding of directory and other data, is problematic for the personnel involved, though the subsequent computational use of the numerical codes is very straightforward and effective. To bridge this operational gap, while retaining the considerable database/data warehouse benefits of the numerical codes, a structured set of mnemonic character abbreviations has been devised, and the railroad sector is being used as a first test of their effectiveness. These abbreviations correspond to the relevant sections of numerical codes, so automated conversion can be undertaken, but unlike the latter, early experience shows they are finding ready acceptance for manual coding purposes. This is facilitated by the fact that a limited group of occupations, such as brakeman (mnemonic = rrb) and conductor (mnemonic = rrcn) in the railroad sector, account for a significant proportion of all employees, so the most frequently used mnemonic codes can be memorized quickly through repetition. After testing is completed, these mnemonics and the conversion tables will be made publicly available in the same way as has already been undertaken for the numerical codes, so other research groups can utilize them if desired.
Conclusion
The system resolves the key shortcomings of the HISCO coding system, by encompassing it within a much more sophisticated structure that allows comprehensive coding of data from both census and non-census sources, to a level of detail compatible with that provided in the original source documents. As it includes the HISCO codes, it maintains a very high level of compatibility with that approach, yet avoids the necessity for separate look-up tables, as provided by Wrigley for the PST system (Wrigley, n.d.). However, as Wrigley has helpfully provided such tables, this also means that a high degree of compatibility exists with that system also, by deploying these intermediate tables in conjunction with standard database queries. The new system has the important ability to standardize employment data from company payrolls and other industrial archives, as well as coding census and vital registration records. The fine breakdown of employment characteristics that it provides offers a much more nuanced approach to the analysis of inter-departmental, inter-sectoral, and geographical mobility than is possible using other coding systems. A further important motivation for its original development was to allow comparison of the demographic characteristics of sub-populations of generic workers in different industrial sectors and this capability has already been demonstrated in a small case study in Baltimore (Healey et al., 2013). Further work on re-coding selected data from the 1880 census in a data warehouse context is planned to develop this approach on a larger scale.
Future Development of the Coding System
While an exhaustive list of occupational titles for different U.S. industrial sectors in the 19th century is probably an unattainable goal, a very comprehensive list can eventually be arrived at through comparison of multiple sources, both printed and archival. While much of the groundwork for this has been laid for the heavy industrial sectors, more can still be achieved by incorporating data from two late 19th century sources. The first of these is the report of the Commissioner of Labor (1890), on railroad labor, which contains complete lists of job titles for a small sample of major railroad systems across the country. While the vast majority of the common titles listed in this source are already in the system, some of the less common ones are not. The second is the Weeks Report and the associated database (Meyer, 2004; Weeks, 1884), which contains a large number of job titles in different industrial sectors, though it makes no claim of completeness. Ideally, more payroll information would also be incorporated, although payrolls both for railroads and large iron/steel works are surprisingly difficult to locate in any quantity for the latter part of the 19th century (see Knowles (2013) for examples of the use of earlier iron company records).
Another issue is that, over time, certain titles fell into disuse, or persisted in some regions but not others, or the nature of the work activity that they represented changed quite significantly, as technology moved forward. The original largely European focus of the HISCO system raises further questions, as many U.S. occupational titles differed from their British equivalents (measured in terms of the tasks involved) or the English translation of French or German terms may not correspond to U.S. usage of the word in question. Hence, there is a wider research agenda, not well-articulated in the literature, than the narrower field of comparability between occupations recorded in censuses internationally. As the new coding system aims to include original job titles, not a subset of standardized categories, it also has the potential to act as an index to more extensive textual resources that describe the “task lists” of apparently equivalent jobs in different companies, sectors, and locations and how these evolve or change over time. This would be a useful extension to the helpful occupational descriptions and illustrations already provided online for HISCO codes (History of Work Information System, 2013). Some progress in this direction has already been made, in terms of identifying published descriptions of railroad jobs in a range of companies from the 1860s onward. While time-consuming to develop such a textual resource, it is straightforward to link it directly to the online version of the coding system. This would also serve to encourage contributions from the wider scholarly community to extend the resource, as this would be of considerable value for future studies of work and labor in the United States during the 19th century.
