Abstract
Introduction
Be more white. Be more male. Be wealthier. Those are the biggest correlations with success. It’s terrible, but it’s the truth.
—Excerpt from interview with Don,1 a university administrator
“The truth” was taken for granted among data scientists, administrators, and programmers in their work on predictive modeling using institutional data to anticipate whether or not a student would graduate within a four-year period. This observation came from Don, a long-time administrator at the university who is engaged in the application of nudges2 derived from the predictive model’s outputs. Sitting across a table from me in the university’s student union, alert for eavesdroppers, he explained how he thought of such inequalities as common but mostly unspoken knowledge among faculty and administrators at the university. Over the course of my fieldwork, he explained to me and university stakeholders that the predictive model only drew from “behavior” data—what he described as “things students can change”—and not demographic markers like race, gender, and socioeconomic status. Don, in telling me “the truth,” suggested these demographic markers were accepted by stakeholders as not only immutable but also indicative of a student’s prospects for success.3
But despite the collectively understood disparities in success, a focus only on the “things students can change” serves to make demographic differences less obvious and integral to the university’s model of success. Through the sociotechnical obscuring of demographic data in predictive modeling and illumination of data that instead highlight behaviors that correlate with success, the likelihood of a student graduating in four years appears more contingent on those behaviors than demographic markers.
In this article, I explore an institutional shift from reliance on demographic data to what administrators, data scientists, and programmers at the university have constructed as behaviors. Universities have long had involvement in reshaping demographic categories (Hanel and Vione, 2016), not only through admissions processes that shape student bodies but also through research conducted in association with institutions that influence culturally held ideals about meritocracy (Warikoo, 2016). As universities take a data-driven approach to such ideals, they increasingly seek predictive power, more context for enrollees and applicants, inventive ways to produce knowledge about students, and methods of sidestepping contentious issues around inequality (Selwyn, 2015). In a departure from demographic data, which are solicited through self-reporting and traditional markers of a student’s category membership, personnel tout behaviors as a better, more neutral alternative. Automated data collection alleges more consistency, standard instruments for gathering, an expansive sample, and direct proxies. However, I argue that in practice the making of new data—behaviors—is no less fraught.
Making data
How do data become behaviors, and how do those behaviors come to take the place of demographic data in how the institution manages its student body? In this essay, I build on critical data studies scholarship (see Iliadis and Russo, 2016) by empirically demonstrating not solely that data are made but also
Drawing from qualitative research I conducted at a large public university in the United States, I argue that by revising predictive modeling and nudging to focus even more intensively on what they deem behaviors, data personnel—data scientists, administrators, and programmers—position students as subjects of nudges responsible for their own success. The institutional reframing of success in terms of “what students can change” enables the institution to transfer the burden of success away from itself and keep the tacitly held knowledge of inequality out of the university’s visions for predictive modeling.
To demonstrate how data personnel produce behavior data in a way that enables the institution to minimize the impact of demographic markers on success, I first provide a brief overview of my research site and predictive analytics in higher education in relation to a larger landscape, the literature that underpins my analysis, and my methodological approach. I then address the typing of data into behaviors and attributes, the sorting of data into behaviors, maintenance of behaviors as accurate proxies, and nudging of students on behaviors as processes that assist personnel in solidifying behavior data.
Research site
This qualitative case study draws from ethnographic fieldwork I conducted at a large public university in the United States across a 12-month period. Data personnel characterize the institution, which hosts roughly 30,000 undergraduate students, as at the forefront of applying institutional data to success initiatives. As an administrator told me, “no one’s ever done the data mining like we’re doing the data mining,” alluding to the university’s computational approach institutional research (IR), expansive data infrastructures, and vision for how data could revolutionize higher education. At the time of its foray into predictive modeling, the university was indeed unique for its repurposing of wireless network usage data as a proxy for students’ whereabouts. Moreover, it has digitized much of its institutional data. When I interviewed Henry, a data scientist, he got up to go to his bookshelf and returned with a massive binder, at least four inches thick. The binder held a printout of deidentified, aggregated student data, the precursor to what is now an online, interactive visualization. M: Oh my god. Volume one of how many? H: Uh, I want to say less than ten but more than two. M: That’s a lot. That’s a lot of paper. H: And they would do this every year. It would take them from the time the data was available until, like, December, just to be able to produce a book of this. Now, when the data’s available, that [interactive visualization is] updated the next day. So, like, that was the world that they were living in, so, like, when you’re living in that world, you don’t have time to do the advanced stuff.
Predictive analytics: From higher education to a broader landscape
The deployment of student data in higher education in the U.S. is widespread and under varying degrees of scrutiny, from the College Board’s short-lived “Adversity Score”4 to growing concerns about student privacy and dataveillance on campuses (Selwyn, 2015). The datafication of universities and surge of predictive analytics projects prompt questions about the future of higher education and what roles universities play in society.
Higher education is one area in which predictive uses of Big Data are gaining momentum. While personnel who worked most closely with student thought their work was exceptional among what they regarded as controversial applications of data, such as predictive policing, predictive analytics in higher education is nonetheless situated within a broader landscape. For example, during my fieldwork I was chatting with the director of IT at the university in an elevator. After hearing more about my research, he immediately asked me about
While medicine was his immediate reference, the university’s development of nudges was at its height during revelations about the analytics firm Cambridge Analytica using data problematically mined from Facebook users to target prospective voters. Beyond the manipulation of politics, John Cheney-Lippold (2011, 2017) has identified the reordering of people according to their social media data as a “new algorithmic identity,” in which data enable new ways to organize people, based on an affinity of clicks as opposed to traditional social categories.
Data that correspond with people are ample and are mobilized by institutions through predictive mechanisms. “Data doubles,” or data that stand in for people in systems and analytics, frequently outperform them (Haggerty and Ericson, 2000; Raley, 2013). Gavin JD Smith (2016), David Lyon (2003), and Cathy O’Neil (2016) have all pointed out the potential destructiveness of proxies and data doubles when treated as the people they represent: denied loans, extended sentences, increased insurance rates, and lost job opportunities, to name a few. Wendy Hui Kyong Chun (2018) writes that these data “absolve one of responsibility … by creating new dependencies and relations” in standing in for what is unknown or inaccessible. It matters which data become doubles, and how those data become data in the first place. This problem is evident in research and investigative reporting on predictive policing and risk assessment, in which police departments take past events as predictive of future activity (Brayne, 2017; Selbst, 2017), or where algorithmically calculated scores are meant to indicate the likelihood of recidivism (Angwin et al., 2016; Benjamin, 2019). The translation of data into predictions is not solely algorithmic; it is also wrapped up in structural inequalities and notions about what society is and ought to be (Eubanks, 2018). And so while the work data personnel in universities do is not the same as predictive policing, it occupies a similar imaginary6 in which what people will do is both possible to anticipate and open to intervention.
Relevant literature
States and institutions have long sought to account for and predict people and activities within them via data collection (Scott, 1998). Data enable auditing practices (Power, 1997; Strathern, 2000), the quantification of people (Bouk, 2015; Desrosières, 1998), and the calculation of risk (Hacking, 1990; Harcourt, 2007). Quantification, especially commensuration, is constitutive of people: quantification, as a social act, makes what it purports to represent (Espeland and Stevens, 1998, 2008). As scholars in science and technology studies (STS) have demonstrated, measuring and predicting are fraught social processes requiring investment and validation through institutional politics and infrastructures (Porter, 1995; Star, 1999; Star and Ruhleder, 1996). Processes of making data and their infrastructures are frequently subject to what Susan Leigh Star (1991) has called “deletion,” or the invisibilizing of labor in scientific work. Such deletions have been at the fore of research on scientific and technological practice and related institutions in STS, though are less present in inquiries into higher education.
Scholars addressing Big Data in educational contexts have largely explored its possibilities, testing out in-classroom technologies, courses scaled up to enroll unlimited students (such as massive open online courses, or MOOCs) (see Jones et al., 2014), and predictive uses of data collected from learning management systems (e.g. Blackboard, Canvas). As George Siemens reports in mapping the field, learning analytics is the use of data to improve learning (2013: 1382). While learning can refer to any variety of educational settings, learning analytics has expanded rapidly in higher education, where such technologies are used by universities to manage risk and understand student bodies (Wagner and Longanecker, 2016).
Such projects, which typically draw from third-party consultants or are developed in-house, have received scrutiny as critics express concerns about universities surveilling students (Harwell, 2019; see also Hope, 2016). While much of the learning analytics literature explores the significance and effectiveness of specific learning analytics initiatives, more recently scholars such as Neil Selwyn have argued that “learning analytics needs to be critiqued as much as possible,” given the potential to disparately impact students (2019: 11).
And learning analytics is critiqued. Scholars are addressing effects on student data privacy (see Ifenthaler and Schumacher, 2016; Rubel and Jones, 2016; Slade and Prinsloo, 2014; Sun et al., 2019). Juliane Jarke and Andreas Breiter (2019) discuss how education is changing with datafication, and Ben Williamson (2017, 2018, 2019) has written extensively about the implications of large-scale data collection on students, both in and outside of higher education. Other education scholars have interrogated the ethics of learning analytics (Johnson, 2014; Slade and Prinsloo, 2013) and prospects for just approaches (Shahar and Harel, 2017).
Data analytics projects like that I discuss deploy nudges in tandem with predictive outputs to suggest to students how they can improve their graduation outcomes and grade point averages (GPA). The nudging that personnel use is aligned with Richard H Thaler and Cass R Sunstein’s outline of nudges and “choice architecture,” in which architects structure the “context in which people make decisions” to “nudge” them toward particular choices (2009: 3). Thaler and Sunstein frame nudging as “libertarian paternalism,” where people are ultimately capable of making their own choices—a nudge is a mild intervention. However, Karen Yeung argues that in contexts of Big Data, the array of data and analytics dynamically available to choice architects means that nudging is “subtle, unobtrusive, yet extraordinarily powerful” thanks to the magnitude and networks of data (2017: 119).
Some of the literature about learning analytics offers strategies for how to more productively nudge students, in which students are framed not just as consumers but also active partners at universities who should be accountable for their own success (Fritz, 2017; see also Pascarella and Terenzini, 2005). The notion of choice architecture in learning analytics rests on conceptualizations of agency where students have unrestricted access to a full range of choices. This take on agency is in contrast to social theorizing on agency, in which actors work within and against constraints (see Bourdieu, 1980; Ortner, 2006).
Some education scholars have commented on the contradictions of deploying nudges in relation to more liberal views of the purpose of education (see Clayton and Halliday, 2017; Hartman-Caverly, 2019). Jeremy Knox et al. explore a growing trend of educational institutions integrating datafied behavioral economics approaches. They remark on the implications of “[shaping] students’ choices and decisions based on constant tracking and predicting of their behaviors, emotions and actions,” noting the potential for disparate impacts (2020: 39).
Some of the appeal of Big Data, and why, perhaps, it links up so well with the surge of behavioral economics in education, is rooted in pervasive and influential “mythologies” of data as truthful and omniscient, which critical data studies scholars have challenged, recognizing data as partial and always already political (Boyd and Crawford, 2012; Dalton and Thatcher, 2014). The promise of data is evident in institutional data mining projects that endeavor to take the place of self-reporting: data personnel understand data as more direct proxies, comprehensive and accurate, or as Rob Kitchin and Tracey P Lauriault put it, “exhaustive in scope” and “fine-grained in resolution” (2014: 2). The presumed neutrality of data enables them to seem prior to interpretation, an incredible, “raw” resource that can reveal insights about humanity (Boellstorff, 2013).
But data must be made. They do not exist as prior to processing. Lisa Gitelman and Virginia Jackson (2013: 3) write that “data need to be imagined as data to exist and function as such.” As I discuss herein, the discursive work involved in creating data is ongoing and layered; it relies on a great deal of labor and transformation. Nonetheless, data are treated by personnel as a stable, bounded entity, not unlike how the engineers in Diana Forsythe’s (2001) work regarded knowledge in programming expert systems. The ways that personnel imagine behaviors and attributes materialize as data, and in turn those data shape how personnel produce and use those categories. Technologies, as materialized discourses that reflect broader social epistemologies, naturalize and crystalize concepts (Suchman, 2007). In the case of data collection, technologies create the categories of people and activities they purport to measure, making them manageable (Foucault, 1972, 1977).
Societal discourses of data draw upon mythologies of data and so seem like a neutral means of revealing order intrinsic to society, although social theorists have demonstrated that ordering processes are a means through which actors make society (Bowker and Star, 1999; Jasanoff, 2004; Latour, 1990). In data technologies, ordering processes make the subjects of ordering ready to be taken up in a system, scaled, standardized, predicted, and nudged (Cheney-Lippold, 2011; Raley, 2013; Stark, 2018). I take the conditions of ordering in the form of discourse as a fruitful focal point to look at how data personnel as actors give shape to data: how they make sense of the institution, their social contexts, and their ideas about data are part of the data technologies they design and implement.
Methods
In this qualitative case study, I used a combination of interviewing and participant observation in university IT and IR offices in which personnel render students into data. I interviewed 30 data personnel using semi-structured techniques in interviews lasting 60–90 minutes, and I conducted follow-up interviews with five key interlocutors who worked most closely with deploying the model and constructing nudges (Bernard, 2011). These personnel primarily included data scientists, developers, and IT administrators, but also network architects and stakeholders involved in developing predictive outputs for students.
Much of the participant observation of my fieldwork took place in meetings. Meetings covered a range of topics, from monthly development updates to explanations of technical details of the predictive model to workshopping nudges to debates about what data mean. Meetings were places where multiple teams came together, data scientists painstakingly explained the mechanics of modeling or qualified results, programmers explained why they arrived at a particular form of nudging, and administrators nixed nudges and passed along institutional memories of data sources. In these spaces, personnel discursively challenge and solidify not only the technical dimensions of modeling but also the data that inform it (see Brown et al., 2017; Sandler and Thedvall, 2017). The constraints and limits personnel face become evident in such spaces, where their ideas are curbed by the top-down vision of the current university administration or where they must execute a stage of development over which they are not in total agreement owing to rapidly approaching deadlines and desire to receive the approval of stakeholders. Their institutional entanglements operate as a check on what they understand as choices available to them.
Although this article is informed by participant observation, I mostly utilize interviews in my analysis because they function as a central space for actors to map out a sociotechnical imaginary of data technologies at the university. In interviews, actors articulate their work and their visions for predictive projects so that modeling is integrated into such an imaginary, in what Sheila Jasanoff has described as “collectively held, institutionally stabilized, and publicly performed visions of desirable futures” (2015: 4; see also Jasanoff and Kim, 2009). The top-down discursive organizing of data that occurs before, during, and after modeling, especially in the context of interviewing in which personnel are asked to provide an account of modeling and nudging, is critical to the formation of an imaginary of predictive technologies. As Nick Seaver (2017: 8) has observed in his ethnographic approach to algorithms, “interviews do not extract people from the flow of everyday life, but are rather part of it.” Interviews enable personnel to imagine the concepts on which their projects hinge.
I transcribed and coded interviews and field notes in NVivo, a qualitative analysis environment. As Foucault (1972, 1977, 1978) has elucidated, discursive practices make the categories they describe, rendering them measurable, governable, and here, nudgeable. I identified where personnel defined data, explained to me or to each other what a data type meant to them, or decided which data could function as proxies for students. I focused on moments in interviews in which personnel speak, define, and sort behavior data into a fixed category (see Wood and Kroger, 2000).
This analysis illuminates personnel’s implicit and explicit delineations about what data are and what they represent, along with how personnel thought data ought to be classified (Strauss, 2005). Interviewing, transcribing, and coding all helped me to make sense of the conceptual work involved in handling institutional data. Because I began my fieldwork well after modeling began, interviews helped me to reconstruct narratives of decision-making about data sources, modeling, and nudging.
I have structured my findings to reflect a chronology of data processing. However, because some of this work occurs simultaneously, I also conceptually order findings, layering them on top of a foundational concern with demographic data and an imperative to nudge.
I begin with the problem of removing demographic data from modeling, which prompted personnel to think about data in terms of types (i.e. attributes and behaviors). I then explore the work of sorting the available data at the institution into a category of behaviors and assigning proxies. By maintaining data as accurate proxies, personnel help behavior data begin to hold together. Finally, the solidification of a category of behaviors enables personnel to nudge students. I conclude by discussing the implications of making institutional data.
Typing data into “attributes” and “behaviors”
The typing of data was the result of conscious attempts from data personnel to nudge students not only effectively but also fairly. In one of my first interviews with Don, I sat in his office, across from him again over his cluttered desk, and asked him to recount some of the early decision-making in model development. He had been involved in the original design of the model and determining which data should be incorporated into it. Don summed up one of the key decisions regarding data: And what we found initially was that all the standard things that you would guess correlate with student success that students can’t change were the big drivers: race, gender, ethnicity, socioeconomic status, what high school they came from, certain kinds of grades, whatever. Well, students can’t do anything about any of that. So, the idea was to take a look and see, well, is there other stuff that seems to correlate.
When data personnel explain what goes into the predictive model, they divide the data neatly into two major categories: “attributes” and “behaviors.” They define attributes as fixed categories, the “standard things” that students “can’t change.” These categories are made up of demographic data, where data on parental income and high school ZIP code are indicators of socioeconomic status. Universities collect data on race, ethnicity, and gender in standardized forms, whether through college applications or through reporting mechanisms in university systems. Data personnel treat attributes as outside of the model’s purview because while they correlate with graduating in four years, they are not actionable. For example, personnel cited that a student cannot retroactively attend a different high school. Moreover, while a student could transition while in college and change gender markers in university systems or might experience socioeconomic mobility, data personnel would not construct nudges to instruct them to do so. Therefore, data personnel regard attributes as off limits in making recommendations, and they were quick to assure me that they would never do such a thing.
By drawing boundaries around attributes, data personnel attempt to seal them off and open up other types of data for usage. The discursive and computational effects of relegating some data as attributes are that those data, and the students who provide them, become stable entities. That is, by treating demographic data as attributes that are frozen—everything a student “can’t change”—personnel remove those data from an ongoing conversation about what they can use in the model. The differing experiences students have on campus that interlock with their race and socioeconomic status, for example, are no longer part of data projects because personnel define them as fixed. Computationally, when some data become attributes, data scientists no longer include them in the predictive model: demographic data, characterized as attributes, do not factor into calculating the likelihood of graduation in four years.
The effect of framing some data as attributes that are off limits for nudging is not that they are permanent, but instead that personnel cannot nudge students to change them. Will, an administrator who helped to develop the model, explained to me that the sidelining of attributes prompted data personnel to look for other factors involved in success at the university: So, since we’re pulling in all this data at the same time and dropping it into the algorithm, obviously there are a number of things that are highly predictive of student success on campus. Their academic preparation before they come into campus. Their GPA while they’re at [the university] obviously is highly predictive. Socioeconomic status things. Demographic markers. But they’re all things that either because it’s too late in the game, we can’t tell a student, “Boy, it would have been great if you would have studied harder in high school.” And we certainly can’t tell a student on a demographic or socioeconomic thing, we can’t say, “Hey, it’d be good if you weren’t so poor.” There’s nothing a student can do with that. Even though it does put ‘em in a higher risk category. So we took those things that were malleable by the students. Things like, how much time they were spending on campus. Whether they were a proxy for whether we believed they were paying attention in class by how much data they were downloading in a class.
Malleability, however, is not a given: it has to be translated into a behavior. Will refers to data downloaded in class as a potential proxy for paying attention, where the data on downloading are available for data personnel to match up with a behavior. The question is not if downloading data is a proxy but rather if it is a reasonable proxy for paying attention. The data available for modeling predate the model itself. Data on students’ downloading habits in class were originally collected for the maintenance of network infrastructures, but personnel have repurposed them as a proxy for attention. Data on downloading were not always behavior data.
Massive amounts of data are available to data personnel, and it is not self-evident what is a proxy for what, nor was it apparent to me if a hard line between attributes and behaviors existed for personnel. I asked Don how data personnel went about distinguishing between attributes and behaviors. He first depicted behaviors as what was left after attributes were removed: We tried to be blind to all of those, and only look at behaviors. Only look at different numbers we had that were indicators of behavior. Behavior could be grades you made in your prior classes, here at [the university]. It could be how many credit hours you’re taking, it could be where you’re living, it could be, any of these things that you have control over, we’ve just clumped them all into the behaviors bin. I guess that we assume that what [students] did in the course of the day, they had control over. Right, so they chose whether they were gonna eat or not … they chose the gym or not, being on campus or not … They chose living where they chose to live. I think they have some say in that … So it seemed to me that any time that they had an opportunity to make a decision about what they were going to be doing, we called that a behavior.
The focus on behaviors defines students both as radical agents and as nudgeable subjects who would benefit from behavioral recommendations based on their data, which include spending more time on campus, attending class, attending supplemental instruction sessions, and registering for courses earlier. Predictive outputs are meant to be engaged with, not just observed.
The implication that students can act on predictions and improve their prospects for success is contested among data personnel at the university. In general, personnel, particularly those who worked closely with students, wanted nudges to have an encouraging vibe that motivated students to act on nudges and incorporate behavioral changes into their daily lives. However, data scientists in team meetings cautioned against giving students false hope. They argued that the likelihoods they modeled are accurate enough that spending more time on campus would not impact the predicted outcome enough to make a substantial difference in the space of a semester. While personnel are not in agreement about the potential effects of nudging and some are torn about its utility, the notion remains that students are responsible for their outcomes.
Sorting data: Assigning data to categories
Data do not automatically fall into categories of attributes and behaviors; rather, they are assigned and are products of discursive moves. As I discovered, the types of data that comprise a student body are multiple, as are their uses and sources. They serve several institutional offices simultaneously and outlive the original intentions behind them.
As a way to demonstrate the array and possibilities for sorting, I arrange types of data in loose sets in Table 1. The data I include in the table are general categories that I have derived from interviews, documentation from an external review, and administrators’ conference presentations about the model. The table lists data that the model does not incorporate, such as demographic data, but I add such data to show how the kinds of data that data scientists have de-siloed are put in conversation with other data sources.
Arrangements of data incorporated into initial data mining and predictive modeling.
ZIP: zone improvement plan; GPA: grade point average.
The assignment of meaning to data in the model, while not arbitrary, is not strictly linked to data sources. That is, the data could align with other interpretations and proxies, and personnel indeed mobilize them for purposes other than their initial use. In my table, I have created three columns and labels to reframe data in terms of how the institution makes and collects them rather than what personnel either offer up as attributes and behaviors or large, unsorted lists of variables decontextualized from sources. I use “self-reported” to describe data that students provide to the institution, typically through the college application process or in campus systems. The data I organize in “infrastructural” data are data created through the everyday operations of the university. Finally, I use “accumulated” as a group for data that students generate as they move through the university in terms of enrollment, grades, and coursework.7 The table is not exhaustive; rather, I aim to depict that data at the university are multiple and extensive.
The primary data that personnel position as indicative of behavior are network logs, which personnel use because they describe them as the best available proxy for behavior. For this, data personnel have repurposed data originally collected by IT to monitor the health and usage of campus WiFi networks. Network logs contain data about time, date, and duration of a student’s use of the WiFi network, along with which routers they connect to and some general information about browsing activity. Because students must log in to the WiFi network using unique accounts administered by the university, they are associated with their WiFi use.
In an office similar to Don’s but in the IR office, I asked Jenny, an administrator involved with data governance at the university, how she and personnel decided to use network logs as a proxy for attendance. She explained how personnel came to use network data: And how we ended up on network logs, you know, it’s just having the right people that are thinking, you know, probably someone picked up their phone and was like, “Hey, I just connected to the WiFi, right.” It’s like, oh, yeah! The WiFi, right. If we want to make the model better, what kind of data, when you think about behaviors, would you want to include, and then you just start thinking, how might you get that data.
Will formulated the leap from attendance to networks differently, recalling his early involvement in institutional data collection. He talked about moving from surveys to Big Data. To him surveys were a problematic proxy for campus engagement. And you give [a survey] to the students at the end of the year and…it would measure how integrated you were to the campus and what your commitment was to it. Or we’d use the NSSE survey, which is the National Survey of Student Engagement, where you’d say, like, “Over the last semester, on average, how many hours a week did you study? How many hours a week did you meet with professors outside of class? How many hours a week did you meet with your peers outside of class?” Those kinds of things. And these were relying on self-report surveys, often after the fact, to measure that level of engagement and integration. And what we provided, in the [model], was nope, here’s an actual behavioral marker where we can truly see how much time a student spent on campus.
The allure of Big Data as a replacement for surveys is that the interpretation is invisible, so smoothly deployed that data
Maintaining data as accurate proxies
The assignment of behaviors to data, and vice versa, requires investment. While I conducted my fieldwork, I had access to the campus WiFi network and through an interface with the predictive model, could see a visualization of my network logs. I kept a personal account of my campus whereabouts and compared it against the network logs. I consistently found chunks of missing time, incorrect geolocations, and overall an inaccurate picture of my time on campus.
I brought the disparity in data up to data personnel, who were either intrigued or unsurprised depending on their proximity to working with the data. Some even joked with me about how their own network logs made it look like they were never at work. Personnel know that network logs are not a neat substitute for the time a student spends on campus. Network outages aside, competition for network access and brevity of connection might prevent a student’s device from registering. Moreover, connecting in the first place requires a student to have a WiFi-enabled device. Missing data abound, for many reasons.
Nonetheless, personnel at the university maintain network logs as an inventive and actual proxy for behavior. I asked Henry, a data scientist working on the predictive model, about absent data, citing my network logs, and he explained that personnel have to move forward without those data: If stuff’s missing, I mean, there’s nothing you can do about it. You just have to hope, and in most cases this is the case, that there is a uniformity to it. So either the whole day is missing, that’ll sometimes happen. That’s fine. Because there’s enough of the data to pick up the slack there … Most of the variables that I’ve made deal with that elegantly … If I just don’t have a certain amount of the data, I don’t say that there was a class session at all. I just say there wasn’t one. So, for a percentage of absences, like, it’s not going to affect it at all. Otherwise, for missing data, the hope is that it’s sufficiently random that for any machine learning purpose, it will not matter that it is missing. Because anything that is sufficiently small and random won’t have an impact on the prediction. That may or may not be true, but it’s an assumption that we have to make because we don’t have a lot of choice.
Nudging on “behaviors” and promoting self-regulation among students
The discursive separation of attributes and behaviors allows for personnel to treat behaviors as actionable, and it enables a more data-driven form of institutional management of students. By making something called behavior and making it legible through data proxies that minimize gaps between data and what they purport to measure, the institution can track students and monitor them. Ultimately, the university uses these data to formulate a narrative of success that maps onto behavior. Following such moves, the institution can relocate where the possibilities for success reside.
Through the assignment of behaviors to data, the category of behaviors holds together. The formation of a category of behaviors allows students to become subjects of modeling and nudging. By reconceptualizing network logs and activity as behavior, personnel effectively produce behavior that they can nudge, whereas before the deployment of the predictive model, the university could not nudge students based on what were primarily demographic data, now it can via behaviors.
In this case, the institution mobilizes behaviors as a means to encourage students to self-regulate and make responsible choices that move them toward graduation. Owing to the discursive maneuvering enacted by personnel, data classified as behaviors become directly tied to students and their activities, as illustrated by Will’s description of data as “an actual behavioral marker.” Akin to how Wendy Nelson Espeland and Mitchell Stevens (1998) have discussed quantification as similar to a speech act, data as a metric for behavior become student behaviors.
Correlations, even as personnel insist they are just that, seem causal in nudges because of the implication that students could improve their likelihoods of success by adhering to particular behaviors. The model of a student presented in nudging is one who attends all classes, spends time on campus outside of class, engages with student organizations, does not browse the internet in class, visits office hours, and so on. If students want to succeed, they should match the model and attune their choices to the behaviors that correlate with success. In an automated manner, the model and related nudging reflect student data onto students, signaling that they are continuously documented and brought into a system of recommendations.
Because attributes are removed from the model and nudging, the reliance on behaviors suggests that students’ choices are at the heart of their success at the institution. Because demographic data are not incorporated into the predictive model at all, success is linked with behaviors and students’ choices. The purposeful presentation of data to students encourages students to internalize those data and act on them. As such, responsibility now rests on the students to take hold of their success. The visualizations of certain kinds of data—namely data students ought to use to inform their everyday decision-making—and obscuring of demographic data place the burden of responsibility and success on students. By minimizing the role that race, class, and gender play on graduation outcomes, the institution, through the model, can present behaviors as major factors in the likelihood of a student graduating within four years. If students do not attend class, a low GPA is a consequence of that decision.
Thus, the constraints around choices become invisible. The university and its existing inequalities start to vanish because success is placed in the hands of students. Social climate problems, structural barriers, issues of belongingness, and resource shortages disappear. A student cannot cite external factors in this model of success dominated by behaviors. The result is a shift in a locus of responsibility, wherein nudging is meant to give students tools to manage themselves and regulate their own behavior based on insights they ought to draw from their data.
Conclusion
One of the aims of this article is to demonstrate how data become behaviors, based on a collective decision to avoid attributes in modeling and nudging, those fixed demographic markers that “can’t change.” The remaining data are then sorted by personnel into behaviors based on what is available and fits within a mold of what personnel understand students to “have control over.” Personnel stabilize data as accurate proxies by explaining and accounting for inconsistencies. Data further solidify as personnel make them act through nudges that direct students to match behaviors correlated with success. The discursive separation of behaviors and attributes allows the university to pin success on behaviors, or everything that a student appears to have a choice in. At a time when the use of demographic data in higher education is ever fraught, behavior data seem full of promise to universities as a step toward the meritocratic ideals of higher education. However, as I have shown, behavior data are not a neutral alternative, and what students “have control over” is not self-evident.
As I sat in his noisy, shared office, Nick, one of the data scientists, responded to my question about where he thought predictive modeling in higher education was headed with an answer about what he felt was more pressing. He thought that nudging had a long way to go, and that the information in nudging needed to be more useful to students. My theoretical framework is a very traditional economical perspective which could be way wrong … but it doesn’t mean that it’s not useful. The theory is that every person chooses optimum behavior for him or herself, based on the constraints, his ability, his resources, his information. He cannot do anything about the first and second, but he maybe can do something about the third.
The use of behavior data to nudge students and inspire a regime of self-regulation prompts questions about behavior data not only as a more accurate substitute for demographic data but also as a source of knowledge about students. As this article demonstrates, making behavior data is a contingent process. Data proxies and data doubles do not acquire form naturally. They must be made, and they are made by actors under institutional constraints and imaginaries alike.
The other aim of this article is to identify processes of making and sorting data as problem spaces. If data, students, and behaviors are imprecisely matched and subject to institutional pressures and sociotechnical limitations, how should those data be understood, especially in the context of nudging where actors put them into play? To refer back to the emergence of what Cheney-Lippold (2011, 2017) has called the “new algorithmic identity,” in which what people do is more representative of who they are than more traditional concepts of social categories, what happens when behavior data are just as fraught as demographic data?
As predictive analytics become more widespread in areas beyond education, such as policing and sentencing, finance, healthcare and medicine, and social media, it is necessary to illuminate the data that underpin predictions, not just technically but sociotechnically. The everyday, frequently mundane processes that make data, proxies, and data doubles hold together are subject to deletion, which allows for the linkages between data and people to appear seamless. They are not. Understanding how those data come together and how they stabilize is a continuing task for critical data studies, and one that I explore in the context of higher education.
