Abstract
What is already known?
Online texts are exponentially increasing in real time, providing new ways and spaces to conduct ethnographic research. The relatively small body of ethnographic literature focused on online spaces that have emerged to date has tended to employ highly specific and nonsystematic sampling techniques to identify and study sites and users of interest. Research into blogging, in particular, has focused on nonrandom, small-scale, longitudinal studies of bloggers across time and has not yet reached out to incorporate larger, comparative, cross-sectional analyses.
What this paper adds?
Online research is an area of increasing interest to qualitative social science researchers, but is still underexplored, especially given its importance in everyday life for billions of people. One possible reason for the underdevelopment of online research to date is an absence of clear methodological protocols upon which to rely when beginning research. Put more simply, ethnographic researchers, with a focus on in-depth knowledge of people and data, often feel overwhelmed by the sheer quantity of online resources and therefore choose to focus on very small online samples. In this article, we expand research into online qualitative methods by discussing methods for systematically sampling online blogs, with an emphasis on the technological barriers encountered during initial investigations and potential solutions to these barriers.
Methodological and Ethical Challenges in Blog-Based Text Analysis
The use of ethnographic techniques to explore online environments is a method of increasing interest to social science researchers, largely because of the increasing importance such spaces play in everyday life worldwide (e.g., Barratt & Maddox, 2016; Bonilla & Rosa, 2015; Bortree, 2005; Gehl, 2016; Graffigna & Bosio, 2006; Hookway, 2008; Horst & Miller, 2013; Huffaker & Calvert, 2005; Karlsson, 2007; Kaun, 2010; Lopez, 2009; McCullagh, 2008; Olive, 2013; Pink et al., 2016; Pitts, 2004; Postill & Pink, 2012; Qian & Scott, 2007; Sade-Beck, 2004; Steinmetz, 2012; Wilson, Kenny, & Dickson-Swift, 2015). Online environments are used to produce meaningful and complex interactions of all different kinds including providing places for people to connect with others for support and to share thoughts, ideas, and stories. In doing so, online environments provide new spaces in which people are able to perform diverse physical, emotional, and social identities (Boellstorff, 2012; de Laat, 2008; Dumova & Fiordo, 2012; Karlsson, 2007; Mautner, 2005; O’Brien & Clark, 2012; Pitts, 2004; Reed, 2005; Siles, 2011; Wilkinson & Thelwall, 2011). Virtual texts, like discussion boards or blogs, proliferate and would appear to be a rich source of material for qualitative analysis. The peculiarities associated with online interactions and presentations of self, however, mean traditional ethnography—with its reliance on in-person participant observation and interviewing—needs to be rethought. At the same time, however, we argue that ethnography, in particular its emphasis on exploring individuals’ discourse and actions within the context of their own chosen milieus via detailed observations and nuanced analysis, has huge potential to illuminate online interactions and identity making. Here we focus on ethnographic approaches to online data collection and analysis.
In this article, we address three main research questions. These are (1) What techniques can be used for systematic sampling in qualitative online research? (2) What technological barriers exist to implementing qualitative systematic sampling and how can they be overcome? and (3) What ethical dilemmas arise in qualitative online research and what strategies can researchers use to deal with them?
To explore these questions, we offer a case study of our own ethnographic study of U.S.-based online weight-loss weblogs (“blogs”). We also outline potential guidelines for ethnographic researchers interested in online research. This is particularly important given that ethnographic researchers, who focus on in-depth analysis of people and culture, often feel overwhelmed by the sheer quantity of online resources, spaces, and users and therefore don’t often pursue collection and analysis of larger cross-sectional samples of online texts. We pay particular attention to the problems that arise when designing systematic sampling methods in an online environment formed in part by the technological constraints and economic incentives of blog hosting services and search engines. We also reflect on the ethical implications of deploying qualitative data analysis using online sources that are not intended for either private, diary-like usage nor directed at a specific public audience (McCullagh, 2008; Siles, 2011; Wilson et al., 2015), and we discuss potential ethical problems stemming from researchers’ identification of online identities across multiple online platforms including social media.
We suggest that an ethnographic approach to web-based research is generally successful in documenting and analyzing data extracted from blogs, while taking account of the different types of cultural contexts in which such data are constructed, but that an ethnographic approach on its own does not account for many of the other underlying factors structuring data that are specific to online environments. In the case of our analysis of blogs oriented around weight and weight loss, these factors shaped both the ways blogs were selected for inclusion, and the content of such blogs, which can be more or less targeted at meeting criteria to increase visibility on the web based on the bloggers’ familiarity with search engine optimization (SEO) and other technologies. We suggest the need for greater research attention to the trade-offs between privacy and identity in online environments and for greater interdisciplinary collaboration between qualitative researchers and computer scientists. In doing this, our goal is to contribute to what is still a relatively small body of online ethnographic research, outlining practical suggestions for social scientists who are interested in engaging with the burgeoning texts generated by virtual fora.
Online Research: Three Prior Approaches
Online spaces have already proven to be rich sources of research material for social scientists (e.g., Boellstorff, 2012; Bortree, 2005; Dickens, Thomas, King, Lewis, & Holland, 2011; Hookway, 2008; Huffaker & Calvert, 2005; Karlsson, 2007; Kaun, 2010; Lopez, 2009; McCullagh, 2008; Olive, 2013; Pitts, 2004; Sade-Beck, 2004; Steinmetz, 2012; Wilson et al., 2015). With the possible exception of virtual world-based ethnographies, however, systematic research methods flexible enough to account for researchers’ varied agendas, classic ethnographic emphases on understanding individuals in context, and the fluid nature of online content and identities have not yet been fully developed.
We have found it helpful to conceptualize research conducted in online environments into three separate categories: (1) the “big data approach,” (2) the “quantitative approach,” and (3) the “qualitative approach” (cf. Bernard, Wutich, & Ryan, 2016). Although these categories are a far from complete representation of the literature, they represent a useful tool for considering the methodological and disciplinary implications of online research focused on social patterns as it currently stands. Given the nature of our own research discussed here, we will pay particular attention to the third approach, but the other two remain vitally important for understanding the contours of current web-based research.
The big data approach is typified by initiatives pursued by corporations, government agencies, and universities, all of which employ quantitative methods to mine large data sets to look for patterns or to predict trends (Boyd & Crawford, 2012). Often, these research projects utilize metadata
To date, the qualitative approach to online research has most often been used by anthropologists and other qualitatively oriented social scientists in the context of ethnographic explorations of geographically bounded communities that have an online component (e.g., Coleman, 2010; Malaby, 2009; Miller & Slater, 2000; Ryman, Burrell, Hardham, Richardson, & Ross, 2009; Williams & Jacobs, 2004). The frequent use of geographically bounded communities (like school groups) as online study subjects may reflect qualitative researchers’ desire for greater context in which to situate online interactions, but it is not the only strategy for exploring these digital spaces. The qualitative approach has also been deployed to ethnographically explore immersive and self-contained virtual worlds such as
A more recent trend in online qualitative research is an increasing interest in users’ behavior on social network sites/social media platforms (e.g., Lijadi & van Schalkwyk, 2015; Murthy, 2008; Snelson, 2016; Tufekci, 2008). Boyd and Ellison (2007) define social network sites as web-based services that allow individuals to (1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and transverse their list of connections…within the system. (p. 211)
Blogs as Ethnographic Texts
Despite the increasing interest in online ethnographies, blogs remain fundamentally understudied as ethnographic texts in the qualitative approach to online research. Here we define an ethnographic text as a piece of writing that explores cultural phenomena from the point of view of community “insiders.” A blog is typically defined as a website with a series of frequently updated, reverse chronologically ordered posts containing original content written by one or more authors (bloggers), where each post includes opportunities for comments and links intended to promote a virtual community (Blogpulse, 2015; Blood, 2000; Garden, 2011; Hine, 2008; Hookway, 2008). Blogs have been spaces for both personal use and community building since their original inception in the earliest days of the Internet as compilations of links to other sites of interest (Herring, Scheidt, Bonus, & Wright, 2004). In contrast to these first collections of ephemera, however, modern blogs generally serve as personalized narratives, performative spaces, and self-reflective commentaries for both the blog writers themselves and the readers with whom they establish relationships (Karlsson, 2007; Konovalov, Scotch, Post, & Brandt, 2010; de Laat, 2008; Monaghan, 2005; Pitts, 2004; Reed, 2005; Serfaty, 2004).
Social scientists have long studied the effects that different kinds of physical spaces have on the particular types of performativity individuals choose to exhibit and the implications for identity projects (e.g., Bourdieu, 1986, 1984; Butler, 1997, 1993, 1990). In the context of virtual spaces, it has been argued (e.g., Nardi, Schiano, Gumbrecht, & Swartz, 2004) that blogs are imbued with such a strong sense of the author’s personality and attitudes that they constitute online diaries, but blogs are also written for an audience and are thus designed to elicit feedback and social networking (de Laat, 2008; Herring et al., 2004; Karlsson, 2007; Pitts, 2004; Rausch, 2006; Reed, 2005; Wilson et al., 2015). This mix of heavily individualized function transposed upon the inherently public space of published online content makes blogs surprisingly difficult to compare to other ethnographic texts such as field notes, interview transcripts, and traditional diaries.
Blogs, situated at the intersection of public and private, can therefore be considered a type of social media, although the practice of blogging predates many other forms of social media associated with particular corporations (e.g., Facebook, Twitter, etc.; Wilson et al., 2015). Furthermore, unlike more bounded social networking sites where all users share a single service provider’s infrastructure, online hosting for blogs is not provided by a single organization. Instead, many possible blogging platforms (WordPress, Medium, Blogger, etc.) are available to bloggers. Blogging is also not restricted to a single type of content: Blogs can be a means for presenting introspective thinking, a record of daily events, a tool for political mobilization, a journalistic project, an open-ended literary experiment, a constant exhibition of images and videos and, in many cases, a combination of all of the above. (Siles, 2011, p. 738)
Prior qualitative research that has analyzed blogs—which are neither physically bounded nor considered “virtual worlds”—has tended to (1) employ small sample sizes, (2) use other forms of data collection to supplement the data extracted from blogs, (3) focus on longitudinal analysis of posts by the same blogger at different points in time, and (4) adopt convenience, snowball, or other nonrepresentative sampling methods (e.g., Bortree, 2005; Davis, 2010; Dickens et al., 2011; Hookway, 2008; Huffaker & Calvert, 2005; Karlsson, 2007; Lopez, 2009; Miura & Yamashita, 2007; Olive, 2013; Pitts, 2004; Qian & Scott, 2007; Sanderson, 2008; Wilson et al., 2015). While each of these trends represents a legitimate choice for approaching research questions in online environments, we argue that they may also reflect mitigation strategies for navigating the difficulties of conducting such research, although they are seldom explicitly framed as such.
Blogging as a Weight-Loss Narrative
Our interest in blogging as a space for public and private identity creation made blogs oriented around weight and weight-loss concerns an appealing source of data. Health, weight, weight loss, and body projects are the common topics of conversation across multiple arenas in the United States today and are simultaneously intensely personal but also open to public debate and comment (Becker, 1995; Boero, 2012, 2010; Bordo, 1993; Brewis, 2014, 2011; Campos, 2004; Casper & Moore, 2009; Granberg, 2006; Greenhalgh & Carney, 2014; Nichter, 2001; Puhl & Heuer, 2010, 2009). It is within this larger discourse on weight and health that the so-called weight-loss blogosphere has developed. Although this term has been used by a few researchers thus far (e.g., Lynch, 2010; Rausch, 2006), it remains vague. Leggatt-Cook and Chamberlain (2011), for instance, in their study of female bloggers, based their definition and selection criteria on whether or not a blogger explicitly stated that she was blogging to support her weight-loss efforts, but they also point out the diversity of approaches and attitudes contained within these parameters. Other researchers (e.g., Boepple & Thompson, 2016) have focused specifically on the more controversial phenomenon of “fitsperation” and “thinsperation” websites (websites that promote fitness and showcase fit bodies and websites that promote and showcase thinness and thin bodies, respectively), but the websites in question are not exclusively blogs, nor representative of the wide array of attitudes toward weight and weight loss present in online environments. These prior attempts illustrate the heterogeneity of data available for analysis: Clearly, online environments offer rich opportunities for qualitative researchers interested in weight loss, but gathering and analyzing this data in a systematic and representative way remain a challenge for researchers.
Leggatt-Cook and Chamberlain (2011) note that weight-loss bloggers are diverse in terms of their chosen method (diet, exercise, surgery, etc.) and express a range of motivations for choosing to blog about weight loss but most express interest in creating a community that will not only help them in their attempts to lose weight but also support them during the periods of doubt and failure and provide affirmation if and when they successfully lose weight. Thus, public documentation of private endeavors to lose weight is both a strategy to be held accountable but also to garner less critical support of the self-being put on display. As Leggatt-Cook and Chamberlain (2011) also discuss, in a larger environment in which fat bodies are routinely stigmatized, the fact that the blogger has more control over how his or her self is constructed for the online audience—via witty narratives and self-deprecating humor, for instance—is also important. We found the diversity of approaches toward weight loss, as well as the ability of bloggers to manipulate their presentation of self to readers, produced complicated and sometimes conflicting methodological and ethical challenges. The latter issue is certainly something that other researchers interested in social media users more broadly must contend with.
The Case of the Weight-Focused Blogger Study
As previously discussed, social scientists have approached online research in general with a wide array of qualitative methods, but fewer studies have exclusively used blogs in their analysis. The aim of the project was to collect narratives from individually authored, public access blogs that discussed body and weight loss and were produced by individuals of any gender living in the United States, highlighting the potential ways in which blogging about weight loss projects influences health. The challenge was to ensure that we obtained a representative and systematic sample of these U.S.-based weight-loss blogs without having a clearly defined sampling frame, while also staying true to basic ethnographic principles that emphasize and contextualize individual lived experiences and perspectives. Here we present our process for a systematic sampling method for user-generated online texts that are outside the boundaries of a single website (e.g., a message board) or content host (e.g., Facebook).
We began by establishing preliminary inclusion and exclusion criteria for the sample. In order to limit possible variation in cultural norms and messaging, we sampled only blogs written by persons who explicitly disclosed in their narratives or “About Me” pages that they were residing in the United States. Even if bloggers disclosed little else about their life outside virtual reality, every writer we sampled did make some reference at some point to their physical location within United States. Bloggers in the United States are generally exposed to a broadly obesogenic but heavily fat-stigmatizing environment (Boero, 2012, 2010; Brewis, 2014, 2011; Campos, 2004; McCullough & Hardin, 2013; Puhl & Heuer, 2010, 2009). Multiauthored blogs were likewise excluded, as one of our research questions about weight loss explored changes in attitude toward weight loss in a single individual over time (see Trainer, Brewis, Wutich, Kurtz, & Niesluchowski, 2016). Additionally, we sampled only bloggers who were directly pursuing weight loss attempts, discussing weight loss or fat stigma, or otherwise engaging in an online space defined by the topic of weight. This was further complicated by our choice to use individual blog posts, rather than the blog itself, as the unit of analysis. Thus, the sample included both blogs that self-identified as weight-loss blogs and blogs that were a mix of topics.
Data collection took place in four stages (see Table 1). In Stage 1, a nonsystematic random sample of blogs was selected using Google for the purpose of generating pilot research. Highly general search terms including weight-loss blogs and “weight discrimination” were used to generate a broad selection of results; 1 these terms were selected based on prior research indicating they would yield a wealth of results relating to weight and weight loss (see Bair, Kelly, Serdar, & Mazzeo, 2012; Ballantine & Stephenson, 2011; Boepple & Thompson, 2016; Das & Faxvaag, 2014; Dickens et al., 2011; Harding & Kirby, 2009; Hwang et al., 2010; Leggatt-Cook & Chamberlain, 2011; Manikonda, Pon-Barry, Kambhampati, Hekler, & McDonald, 2014; Pitts, 2004; Rausch, 2006; Saperstein, Atkinson, & Gold, 2007; Tiggemann & Miller, 2010; Walstrom, 2000). The results provided by these terms were not included in the final sample, as the search terms that produced them were not systematically selected (i.e., we selected them ourselves, based on prior research) but were instead analyzed to refine search terms and phrases in the second phase of data collection.
Cross-Sectional Data Collection Using Weight-Oriented Blog Posts.
In Stage 2, 12 new search terms based on frequently repeated and/or highly salient words or phrases found in Stage 1 results were used to query the three most popular search engines in the United States: Google, Bing, and Yahoo. These terms also diversified the sample in terms of blogger self-identified gender because of the nature of many of the salient phrases (e.g., “fat guy on a diet” vs. “fat girl on a diet”). We systematically tested the effects of adjacency and proximity operators like quotation marks with each search engine using each search term or phrase. As all three search engines present ranked results based on a website’s relevance to the search terms used, only the first 50 returns meeting the previously mentioned inclusion and exclusion criteria on each search engine were considered, based on the assumption that these results would be most salient. Duplicate results within or across search engines were included only once. For queries that returned more than one blog post from the same blogger, we included whichever blog post was returned first by search engine. The sample was then screened to ensure blogs met previously listed inclusion criteria. The 112 blog posts were found eligible.
Stage 3 of data collection used purposive sampling to expand the diversity of blogs represented, particularly in terms of blogger sexual orientation, geographic location within the United States, religious affiliation, and race—all demographic factors that previous research on fat and obesity in the United States have posited as influencing attitudes toward fatness and weight loss (e.g., Boero, 2012, 2010; Brewis, 2014, 2011; Granberg, Simons, & Simons, 2009; Greenhalgh & Carney, 2014; McCullough & Hardin, 2013; Puhl & Heuer, 2010, 2009). Frequently occurring phrases from the 112 blogs selected in Stage 2 were used to seed this new round of queries. Otherwise, identical inclusion criteria to those used in Stage 2 were used to screen results in Stage 3. Research centered on weight-loss attitudes and practices in offline environments indicates that although weight loss is a nearly universal concern in the United States, the ways this concern is expressed and acted upon differ markedly, which is why we adopted purposive sampling (Becker, 1995; Boero, 2012, 2010; Bordo, 1993; Brewis, 2014, 2011; Campos, 2004; Casper & Moore, 2009; Granberg, 2006; Greenhalgh & Carney, 2014; McCullough & Hardin 2013; Nichter, 2001; Puhl & Heuer, 2010, 2009). Ultimately, an additional 86 blog posts were added in this stage.
In Stage 4, all previously selected search terms and phrases were queried using DuckDuckGo, a search engine that does not optimize results based on the collection of users’ demographic information or prior search history. This final set of queries helped to validate the representativeness of our sampling results by replicating and expanding the sampling frame extracted from Google, Bing, and Yahoo. This final stage yielded 36 additional blog posts that had not appeared in previous search engine queries.
By the end of data collection, therefore, we had collected 234 blog posts from 234 different bloggers. For further contextualization of blog posts and to ensure the diversity of our sample, we also collected demographic and background information on the bloggers themselves using their About Me pages and/or personal information contained within the blog narratives. We then compiled the information on age, gender, education level, socioeconomic status, weight-loss history, health concerns, and attitudes toward blogging. The constructed nature of a blog did not allow us to verify background information without a serious breach of bloggers’ privacy (just as a background check to “verify” an interviewer’s chosen self-presentation could easily violate privacy in an in-person interview-based study). Although previous studies indicate that image management strategies are common on blogs (Bortree, 2005) and bloggers may reduce the amount of personal information shared if privacy is a concern (Krasnova, Günther, Spiekermann, & Koroleva, 2009), a high frequency of deception does not appear to be common. Age was difficult to establish, for example, but all but three bloggers disclosed their gender status.
Data analysis of the 234 blog posts followed qualitative thematic coding methods as described in Bernard, Wutich, and Ryan (2016). Thematic coding is a technique, often used in ethnographic research, designed to draw out salient themes, in this case related to the issues of weight and weight loss, and allows comparisons both across blogs and within the same blog through time. We developed our codes inductively through an iterative process that drew out repeated patterns present in the data (Bernard et al., 2016). Our coding schema addressed the following areas: overall writing style and tone of each blog post, expressed attitudes toward weight and weight loss within the post, and bloggers’ portrayals of their own approaches to weight and weight loss.
Using the same sample of 234 bloggers, we then looked longitudinally at their posts across a 10-year period, 2005–2015 (the end date of our data collection). Given the variation in postfrequency across many of these bloggers, as well as the sheer volume of narrative material some of the bloggers generated during the period, we established specific time parameters on our search, examining blog entries that had been written in January and June only, following Rausch’s (2006) suggestion that these months witness particular cycles of preoccupation with weight gain and loss in the U.S. context (the former because of New Year’s resolutions postholiday feasting and the latter because of the incoming swimsuit season). For authors who did not begin blogging until after 2005, the earliest available post that fits the selection criteria was used. In the longitudinal sample, we switched from a consideration of the blog post itself (the one collected via search engine) as the unit of analysis to a consideration of the blogger as the unit of analysis and we explored shifts in attitudes toward weight and weight loss over time, as well as actual reported changes (or not) in blogger weights (for a more detailed discussion of the methods and results of our longitudinal thematic coding analysis, see Trainer et al., 2016).
Producing a rigorous method for sampling and analyzing textual data collected from blogs proved a complex endeavor. At the project’s inception, we anticipated that the opportunity to use machine-driven sampling methods like search engines would accelerate and simplify data collection, while blog narratives would be well suited for text analysis because of their self-contained nature. These assumptions were repeatedly challenged during the course of our study because (a) key characteristics of search engine technology were structurally contradictory to the principles of systematic sampling and (b) bloggers did not always construct a single identity contained on a single blog but often built multiple, shifting personas across multiple, shifting types of social media.
Sampling Challenges in Blog-Based Narrative Data Analysis
Although scientists are increasingly utilizing online environments as sites for qualitative research, approaches to sampling in these environments have largely reproduced the strategies used for research in offline contexts, while acknowledging the difficulties of sampling in sometimes highly fragmented digital spaces (Davis, 2010; Stern, 2004). Sampling in off-line spaces generally uses a list from which some elements are chosen based on certain criteria (randomness, representation of specific types of cases, etc.; Bernard, 2012). Numerous sampling and recruitment errors and biases have been previously discussed in the literature, particularly with regard to surveys (Dillman, 1991; Groves et al., 2009; Groves & Lyberg, 2010; Wright, 2005) and qualitative methods (Bernard, 2012; Guest, Bunce, & Johnson, 2006; Marshall, 1996). Sampling and recruitment errors can affect the generalizability of results in representative samples and prevent researchers from reaching thematic saturation in ethnographic research using purposive or convenience samples (Bernard, 2012; Groves, et al., 2009). In considering the possibilities for similar errors in online environments, considerable attention has been devoted to sampling methodologies in online survey and focus group research (Bethlehem, 2010; Boydell, Fergie, McDaid, & Hilton, 2014; Couper, 2000; Dillman & Bowker, 2001; Fan & Yan, 2010; Sinclair, O’Toole, Malawaraarachchi, & Leder, 2012). The recruitment and sampling errors potentially present in online
We attempted to answer our first research question (what techniques can be used for systematic sampling in qualitative online research?) through the adaptation of nondigital sampling methods to create a novel method for sampling online texts through the use of “seed” search terms, as described in Stage 1 of our sample. We then utilized an iterative sampling strategy (Stages 2–4) in an attempt to ensure a systematic and representative sample. We encountered several technological impediments to these goals. The sampling strategy outlined in our study is predicated upon the use of search engines (Google, Bing, Yahoo, and DuckDuckGo) to identify and extract potentially eligible blog posts. Employing a search engine such as Google to access this type of online content is intuitive and natural to anyone who has spent a significant amount of time on the Internet, whether in an academic or lay context. Several aspects of search engine-based sampling frames may be problematic from a methodological standpoint, however, and it is worth discussing each of these in detail.
The first issue stems from the fact that search engine query results are presented to the user hierarchically based on certain criteria. These criteria can include attempts by search engines to determine a result’s salience to a user’s query. Google’s
Another issue arises from the fact that search engines further enhance their results by considering a user’s previous search history and which results received a click-through in order to personalize the results received by a user. To track this information, many websites use individualized text files called browser “cookies” that are placed on personal computers when the website is visited. Each unique file contains information that notifies the website of the user’s previous visits. Functionally, cookies act as a website’s “memories” of users. In the case of search engines, cookies help track queries from specific computers, and this information is then used to predict which results are most desirable to that user in the future (Google, 2015). Search engines that do not employ cookies or other tracking mechanisms are available—again, our use of DuckDuckGo to check sample results in this case study was important. Here we wish to draw attention to the fact that cookies and related tracking technologies are a potential source of error when constructing samples in online environments. They are also largely invisible to the user. A researcher, therefore, must have prior familiarity with the underlying technologies and strategies used to produce search engine results before this source of error can be accounted for.
Rankings can also be manipulated, however, by blogger or other content producer-driven strategies, which are collectively often referred to as SEO. SEO is now an industry of its own, with consultants frequently hired by online businesses to increase their rankings in search engine results in order to receive greater visibility or to increase income from advertising hosted on their webpages (Furnell & Evans, 2007; van Couvering, 2004, 2007). “Cloaking,” or designing a separate page optimized for discovery by search engine indexing algorithms, is an example of SEO (Malaga, 2008). Cloaked pages are inaccessible using standard web browsing methods and will never be seen by web users; their only function is to increase the ranking of the website in search query results by meeting search engine criteria. Factors such as ranking and SEO strongly affect search engine results but are not discussed in search engine results’ pages or otherwise made obvious to users. Because these techniques are also invisible to the user, we cannot even detail the ways in which this may have affected our sample, but we certainly noticed that three or four bloggers appeared in every single one of our search phrase queries.
A final problematic aspect of search engines that we wish to bring up in this article is that search terms commensurate with search engine algorithms are difficult to generate empirically and systematically. Wilkinson and Thelwall (2011) point out that the construction of relevant key words is surprisingly difficult, due to a range of factors including polysemy, synonymy, and outright spam (messages sent via automated messaging systems). As previously discussed, we used an unsystematic and informal list of key words (e.g., weight-loss blogs and weight discrimination) as seed terms to locate and analyze a first round of blog posts. The content of these blog posts was then used to create a combination of natural language and key word search phrases (e.g., “my struggles with weight loss,” fat girl on a diet, and “dieting body acceptance”). These phrases were based on empirical data (blog posts) rather than heuristics (what the researchers thought would yield good results), but they may also be out of sync with search algorithms, which attempt to interpret natural language in a machine-compatible way (Google, 2015). Since search engines make certain assumptions about topic words and allows particular operators, the phrase my struggles with weight loss yields different results from the collection of key words “weight-loss struggles.” Furthermore, over the course of our research, we found major disparities in bloggers’ comfort and expertise with online platforms and technologies. Bloggers with greater Internet savvy are more likely to be oriented toward key word rather than natural language terms in order to facilitate high rankings on search engines (Furnell & Evans, 2007), while bloggers with less familiarity with key word optimization may be listed lower in the query results or even omitted altogether. Using empirical data to produce search terms, as we have done here, pushes forward online sampling methods but produces contradictions between systematic term generation and the underlying technological and economic structures shaping online content.
Ultimately, researchers working in online environments must, at minimum, document errors and biases in standard methodological techniques, but they can only do so when they are aware of them. Prior research indicates that search engine returns are biased in favor of commercial sites, popular sites, and U.S.-based sites (Croteau & Hoynes, 2006; van Couvering, 2004). By focusing on U.S.-based sites only, we eliminated one source of bias; by focusing only on private-authored blogs, we eliminated another. We noticed a trend in our results, however, whereby more widely read blogs, blogs written by individuals with more technological savvy, and blogs written by individuals who had successfully attracted commercial advertising were more likely to appear multiple times across many of our different search terms. These trends are possible consequences of the influence of search engine rankings, SEO, cookies, and the varying levels of Internet expertise across bloggers. The development of systematic methods documenting and correcting for these possible sources of error would be an important next step in the development of online research methods.
Ethical Challenges in Blog-Based Narrative Data Analysis
As previously discussed, the ability of blog authors and other online content producers to manipulate their presentation of self, as well the unclear boundaries between blogs as spaces for private reflection versus as content for public consumption, have ethical implications of interest for social scientists pursuing research in online environments. In an effort to respect at least the most immediately apparent of these boundaries, we chose to sample only blogs that were public access rather than part of a restricted access forum. Public access blogs are written for an unrestricted audience and are thus officially in the public domain (Sixsmith & Murray, 2001). Public-access blogs, however, may still contain information that would be deemed highly personal in other contexts (McCullagh, 2008; Siles, 2011), and academics hotly debate the implications of using personal information gathered online. Some feel that bloggers and their material must remain anonymous in academic work (even though many bloggers and commentators use fictionalized web names to begin with), while others argue that it is an issue of document copyright and intellectual property, and thus, bloggers and their material must be cited properly (Pitts, 2004; Wilkinson & Thelwall, 2011). We adhere to the latter perspective here—with important exceptions. We follow Snodgrass (2015) in his suggestion that researchers adhere to “local understandings of what constitutes public as opposed to private online exchange” (p. 473). In our context, the bloggers we studied overwhelmingly treated text reproduction and publication as a copyright issue, that is, their blogs indicated they wish to be cited correctly, and we prioritized their explicit wish for correct citation over the possible additional privacy of remaining anonymous. An important subset of blogs did contain warning messages that the blogger did
Additionally, while it may be tempting for researchers to anonymize authors when sharing results based on the highly personal stories laid out in blogs, attempts to protect bloggers through anonymization also carry an ethical risk. Prior research suggests that blogger disclosures are intentional and people are aware of the trade-offs between privacy versus developing recognizable online identities and followers through repeated sharing of personal experiences (McCullagh, 2008). In some cases, therefore, anonymization may actually erase the bloggers’ own judgments about what is too risky or intimate to share and works against individuals’ attempts to construct online identity and community. Ultimately, we deferred to the blog authors themselves, insofar as possible, by using the stated requests on their blogs for quotations to be correctly cited (or, in some cases, their requests for their content not to be quoted) as a guide for their level of inclusion in our reporting of results. In cases where researchers are pursuing a more embedded ethnography that involves contact with participants engaged in digital content creation, it may be appropriate to ask informants directly what level of privacy they prefer.
We found that the public–private duality became even more complex when considered across multiple contexts. Producers of online ethnographic texts—in our case, bloggers—easily move across content platforms, and, perhaps even more importantly, researchers can
Even the very concept of “online” as a separate space from in-person interactions was challenged during the course of our research. Certain bloggers, for instance, not only attracted followers but also online trolls (defined variously by Urban Dictionary as “One who purposely and deliberately starts an argument in a manner which attacks others on a forum without in any way listening to the arguments proposed by his or her peers” and “Being a prick on the Internet because you can”). Some of these trolls not only made negative comments in the online comments section but also made threats against the bloggers that affected them offline as well (e.g., disrupting speaking events and public appearances.) In these instances, therefore, What is the role of the researcher? Does one passively observe (“lurk”)? Does one weigh in? Does one publish controversial blog content that seems certain to attract more negative attention toward a particular blogger—even if that blogger allows the reproduction of the text of her blog? Online identities can merge with off-line experiences to create ethically complex hybrid situations, and it is quite probable that future researchers will need to consider similar interstitial spaces, as the use of highly charged insults and harassment, including rape and death threats, is well-documented online (Consalvo, 2012; Jane, 2015; Mantilla, 2013).
Conclusion
In our attempts to understand what techniques are available to perform systematic sampling for online qualitative research and what technological barriers and ethical dilemmas researchers might encounter, we articulate a number of key challenges for researchers. During our qualitative study of weight-loss bloggers, we found that the translation of research methods between standard and online texts is not a simple one. Where online text analysis solves some problems faced in traditional qualitative research (like easy access to that text), it also creates other, different challenges. Technological barriers, like search engine bias, could be managed but not eliminated and will likely require collaborations with computer scientists to deal with effectively. Ethical considerations, like the complicated ethics of working with bloggers and others who have developed unique and potentially highly identifiable identities across multiple online platforms, are yet to be fully appreciated and may require ongoing dialogue between researchers, ethical reviewer boards, and key ethnographic informants.
A goal here is to open a wider discussion about how we can harvest blog and other online texts in ways that are systematic and replicable, while also respecting and protecting those who produce them. We suggest that this is a major emerging set of issues that will require broad consideration among qualitative researchers, complicated by rapidly changing technologies and a multitude of complex and sometimes conflicting strategies for creating and navigating identity in online spaces. Despite these methodological and ethical challenges, the use of blogs as ethnographic texts—and ethnography in online spaces more generally—is filled with truly promising new types of possibilities for studying text and those who produce it.
