Sage Journals: Discover world-class research

Abstract

Facebook research has proliferated during recent years. However, since November 2017, Facebook has introduced a new limitation on the maximum amount of page posts retrievable through their Graph application programming interface, while there is limited documentation on how these posts are selected. This paper compares two datasets of the same Facebook page, a full dataset obtained before the introduction of the limitation and a partial dataset obtained after, and employs bootstrapping technique to assess the bias caused by the new limitation. This paper demonstrates that posts with high user engagement, Photo posts and Video posts, are over-represented, while Link posts are under-represented. Top-term analysis reveals that there are significant differences in the most prominent terms between the full and partial dataset. This paper also reverse engineered the new application programming interface’s ranking algorithm to identify the features of a post that would affect its odds of being selected. Sentiment analysis reveals that there are significant differences in the sentiment word usage between the selected and non-selected posts. This paper has significant implications for the representativeness of research that use Facebook page data collected after the introduction of the limitation.

Keywords

Bias detection data mining Facebook pages application programming interface social media research

Introduction

Founded in 2004, Facebook has become the most popular social networking site in the world. As of today, it has on average 1.45 billion daily active users and over 2.20 billion monthly active users (Facebook, 2018a). It was reported that around three-quarters of Internet users have a Facebook account, among them 7 in 10 access the platform daily (Duggan, 2015). The platform allows users to post and share content such as status updates, photos, links, and videos. As a result of its sheer online ubiquity, Facebook has served as a new platform on which innumerable social interactions are performed every day. This provides researchers with an unprecedented opportunity to study society and social phenomenon. Facebook research has proliferated across academic disciplines during the recent years, ranging from sociology, politics, law, economics, and psychology, to informatics, marketing, and communication studies (Wilson et al., 2012).

Like most social networking sites, Facebook provides an application programming interface (API) to third-party developers with the aim of fostering application development and integration with other services. Facebook’s Graph API is the primary way to extract data out of and upload into Facebook (Facebook, 2018c). It is an HyperText Transfer Protocol-based API that programs can use to query data, post new content, manage ads, and perform a wide range of other tasks (Müller and Thiesing, 2011). Researchers can also use the API to obtain data, while limitations on the type, quantity, and frequency of data extraction are often in place. In contrast to Twitter, which has become well known for its open data policy, Facebook has been comparatively restrictive in terms of what data can be extracted. Furthermore, Facebook also retains the right to change or terminate its data interfaces, which could lead to substantial problems for academics researchers (Rieder, 2013; Rieder et al., 2015). For instance, in reaction to the Cambridge Analytica controversy in March 2018, Facebook has announced a plan to substantially tighten the access restrictions to its API, this has led Alex Bruns, president of the Association of Internet Researchers and colleagues to issue a public response for their concerns on the collateral damage to academic research (Bruns et al., 2018; Schroepfer, 2018). After a series of changes, most of the user data, including user posts and friend lists, are no longer retrievable through the Graph API but page data was still retrievable with Page Public Content Access permission. However, in August 2019, Facebook has removed the Page Public Content Access permission for Netvizz and other data extraction applications, thereby blocking all research access via the Graph API (Rieder, 2018).

In fact, the tightening of access has begun long before the Cambridge Analytica scandal, since v2.11 (released on 7 November 2017), Facebook has limited the maximum amount of posts retrievable through the Graph API to 600 per year for any page. While the documentation states that “the API will return a maximum of 600 ranked, published posts per year” (Facebook, 2017), how these posts are selected or ranked is currently unknown. To the best of the author’s knowledge, due to the novelty of restriction, there is currently no study about the effect it may have on the data. However, gaining insight from similar problems, for example the rate limitation of the Twitter Streaming API (Morstatter et al., 2014, 2013), it is reasonable to suspect that the restriction could lead to bias in the data returned through the API if the page produces more than 600 posts per year. This paper seeks to assess the influence of the new limitation on the measures and metrics performed on the page data extracted through the new API.

This paper begins the analysis by employing statistical measures commonly used to compare multiple datasets. This work first investigates the difference in the proportion of the post type extracted by the old and the new API. This paper then compares key social media metrics, including Likes, Shares, and Comments, of the full and partial dataset. Drawing on bootstrap resampling technique, this paper also assesses how those metrics would look like for randomly drawn datasets and compare them against those of the partial dataset. Adapting the method employed by Morstatter et al. (2013) to detect biases in Twitter’s Streaming API, this paper compares the top terms obtained from the old and new API with Kendall rank correlation coefficient. This paper then attempts to reverse engineer the ranking algorithm of the new API with logistic regression and identifies the predictors of the odds of a post being selected. After that, Sentiment Analysis is used to investigate the difference in sentiment between the selected and non-selected posts. Finally, this paper also attempts to replicate the findings with data from another Facebook page.

Related work

Related prior work from three different perspectives is discussed in this section: work with Facebook’s Graph API, bias in black box systems, and bias in social media data.

Work with Facebook’s Graph API

Facebook’s Graph API has been used throughout the domain of social sciences and informatics to gain understanding of how users behave on the platform. Popular research domains that often use Facebook API include social network (Hogan, 2008; Spiliotopoulos et al., 2014), public opinion and election campaigns (Caton et al., 2015; Gulati and Williams, 2013; Larsson, 2015, 2016), political activism and participation (Chan et al., 2016; KhosraviNik and Zia, 2015; Langlois et al., 2009; Lee et al., 2015; Tang and Lee, 2013), diffusion of information and misinformation (Bakshy et al., 2015; Bessi et al., 2016; Chan and Fu, 2017; Del Vicario et al., 2016, 2017; Sun et al., 2009), online debate (Van Es et al., 2014), and so on. There is a vast amount of studies conducted using Facebook as the data source, this review can only provide a cursory glance at the wealth of literature. A more comprehensive review can be found in Wilson et al. (2012).

Apart from studies that directly obtained data through the API, there are also studies that rely on third-party applications to obtain data from Facebook. Netvizz, for example, is a data collection and extraction application that uses Facebook’s Graph API to extract data (Rieder, 2013). Studies that use these applications will also be affected by the bias of Facebook’s Graph API and therefore relevant to this work.

Bias in black box systems

This paper is related to the topic of assessing the results from black box systems. In particular, studies that assess Twitter’s APIs are especially relevant. Choudhury et al. (2010) analyses the effect different sampling methods have on the way link propagation is perceived. Morstatter et al. (2013) focuses on the sample bias from Twitter’s APIs. Their work compares four commonly studied aspects of the Streaming API and Firehose data. Using correlation, they first investigate the bias between the top hashtags in the two datasets and reveal bias in their occurrence. The authors also compare the aspects of topic model, network, and geographical locations. They find various discrepancies between the datasets in the topics extracted through latent Dirichlet allocation (Blei et al., 2003) and the authors extracted from the User × User retweet network. There is no bias found in the number of geotagged tweets. Adopting the method proposed in their previous work, Morstatter et al. (2014) propose a method to detect sample bias without the need for the “gold standard” Firehose data, which is often considered too costly for most researchers. By comparing with the Sample API, the authors are able to identify time periods in the Streaming API data where the trend of a hashtag is biased. Lastly, they also study the stability of the data when the queries are originated from different geographic areas and began at different times.

Similarly, Driscoll and Walker (2014) compare Twitter’s Streaming API with the Firehose provided by Gnip PowerTrack, one of the partner firms authorised to resell Twitter data. They suggest that while the Streaming API offers a good choice for longitudinal data collection, it performs poorly for huge, short-term events. On the contrary, PowerTrack could return very large collections of tweets posted within short periods but entails an extremely high cost in the long term. Tromble et al. (2017) also compare data from Firehose, Streaming, and Search APIs. Using Kendall’s tau and logistic regression analyses, they identify the user and content features that make a tweet more or less likely to be returned by the APIs.

Bias in social media data

The potential bias caused by the data generation and extraction process of social networking sites is well studied in the fields of social sciences and informatics. A significant amount of studies warn about limits and pitfalls of using social media data in general (Giglietto et al., 2012; Hargittai, 2018; Lomborg and Bechmann, 2014; Plantin et al., 2018; Tufekci, 2013, 2014). Hargittai (2018) discusses the representativeness of social media and points out the oversampling of people from a privileged background. Situating the problem in a wider context, Plantin et al. (2018) warn about the profit-motivated and proprietary nature of Facebook. The authors suggest that the creation of API has turned the open web consisted of URIs and repeatable HTTP transactions into “walled gardens”, where Facebook has the final control and communications are filtered by a profit-extracting sieve.

Concerning the work specifically about Facebook’s API, Müller and Thiesing (2011) provide a technical overview of the design and possible uses of the API, but there is little critical assessment. Lomborg and Bechmann (2014) suggest that APIs have an in-built bias toward the most active content contributors. Bodle (2011) provides a critical assessment of Facebook’s APIs, but the paper’s focus is on their threat to privacy, data security, transparency, and user autonomy as well as how Facebook achieves market dominance and user dependency using APIs. Rieder et al. (2015) offer a critical examination of Facebook’s Graph API and identify the potential and concern to the social studies using page data obtained through the API. The authors warn that social media data are not produced by methodological devices designed by researchers, but technical interfaces developed by the platform provider. They identify the issues of data detail, completeness, consistency, and architectural complexity, and suggest that Facebook’s Graph API does not always provide complete access to all data and therefore the data should be used with prudence and critical attentiveness.

While these works have rightly pointed out the potential bias in Facebook’s Graph API, they are not able to identify the specific direction and magnitude of the bias. Furthermore, they are unable to address the specific effect introduced by the recent changes in the API. To the best of the author’s knowledge, there is no existing study on the effect of the new 600-post-per-year limitation introduced by Facebook in November 2017. This work will fill the academic gap by investigating the potential bias with statistical measures.

The data

The Scottish National Party’s official Facebook page was selected for the analysis. It was selected due to practical reasons: as a result of the changes in Facebook’s Graph API, there is no publicly available means to obtain the full data of any page without violating Facebook’s Terms of Service at the time of writing. Only data collected before the introduction of the limitation can be used. The page data used by this work was collected previously for the author’s other research. The author recognises the limitation of using only one case, and therefore replication will be conducted using data from another page towards the end of the paper.

This work involves two rounds of data collection. First, on 7 May 2017, a request was sent to search all posts dated between 1 January 2016 and 31 December 2016 on the page using Netvizz v1.42 (Rieder, 2013), a Facebook application that extracts page or group data through the Facebook’s Graph API. A total of 1031 posts were found and extracted. Second, after the introduction of rate limitation, all posts from the same time period were searched again with exactly the same parameters and extracted with Netvizz v1.45 through the new API on 6 April 2018. The search returns a total of 598 posts, an amount very close to the limits of 600 per year as suggested by Facebook. The discrepancy between the two searches suggests the limitation was in place. It is worth noting that posts could have been deleted by the page owner between the time the first dataset was collected and the second. This could also contribute to the discrepancy between the two datasets. However, we cannot know for sure without confirmation from Facebook or the page owner.

The data collection produced two datasets, a full dataset consists of all the posts on the Facebook page (refer to as “full dataset” thereafter) and a partial dataset (refer to as “partial dataset” thereafter) consists of approximately 600 posts selected by the new API. One of the key questions asked in this work is how those 600 posts were selected. This will be discussed in the next section.

Statistical measures

This paper compares the statistical properties of the two datasets with the aim of assessing how well the characteristics of the partial dataset match those of the full dataset. This section begins by comparing the proportion of post type, then continues to compare the key social media metrics. Finally, this paper compares the top terms in the posts using rank correlation statistics.

Post type

Table 1 reports the post types and their counts in the datasets. With only one unbiased view from the full data, it is difficult to tell whether the discrepancy is significant or if it is just a small deviation from a random sample. In order to assess the randomness of the partial dataset, this paper employs the bootstrap re-sampling technique to create 1000 sample datasets (Efron, 1982). First, posts of the full data were selected uniformly at random (without replacement) until the same number of posts of the partial dataset (598) was reached. This process was repeated 1000 times. A list of 1000 datasets, each consists of 598 posts drawn randomly from the full dataset, was created. The statistical measures of these bootstrapped datasets represent how a dataset would perform if it is selected at random. By taking the mean and standard deviations of the metrics of these bootstrapped datasets, confidence intervals can be obtained. In the following figures, confidence intervals of two standard deviations and three standard deviations are plotted. Due to the nature of the bootstrapping method, observations fall beyond these confidence intervals are statistically significant at confidence levels of 95% and 99.7%, respectively.

Table 1.

Post count by type.

	Event	Link	Note	Photo	Status	Video
Full	1	501	3	329	1	196
Partial	1	146	1	301	0	149

Due to the rarity of Event, Note, and Status, analysis on these post types will not be possible. Therefore, five posts of these post types were dropped from the full dataset and two from the partial dataset. This paper will focus on the post types of Link, Photo, and Video. Figure 1 shows the density of the count of Photo, Video, and Link posts. A mean value of 190.62 for Photo post is observed. Taking confidence intervals of three standard deviations, there is 99.7% chance that the count of Photo posts from a dataset drawn at random will lie between 213.55 and 167.7. However, it is observed that the partial dataset contains 301 Photo posts. This could be indicative of a filtering process in Facebook’s new API which caused an over-representation of Photo posts in the data.

Figure 1.

Count of different post types.

A similar pattern can be observed for Video posts. A mean value of 113.43 is observed. Confidence intervals of three standard deviations suggest that the volume of photo posts would lie between 131.84 and 95.01 if the posts are selected at random. A count of 149 is observed in the partial dataset, which is higher than the upper bound of the confidence interval. The analysis also indicates an over-representation of Video posts.

Figures for Links show an opposite trend. A mean of 291.03 posts is observed. The upper bound of the confidence intervals is 314.98, while the lower bound is 267.09. However, the count of Link posts of the partial dataset is 146; the result suggests Link posts are under-represented in the partial dataset.

Bootstrap re-sampling suggests that the statistical measures of the partial dataset show statistically significant deviation from the distribution of post type in randomly drawn datasets. The new Graph API seems to over-represent Photo and Video posts, while Link posts are under-represented.

User engagement metrics

Likes, Shares, and Comments are important metrics for user engagement on Facebook and also important means for users to interact with one another on the platform. Even though Facebook’s News Feed is always described as a black box system, these metrics are important factors that influence if a post is fed to another user (Facebook, 2018b). Therefore, it is reasonable to expect they might as well affect the chance of a post being selected by the new API.

Table 2 shows the mean value of the metrics observed in the full and partial datasets. Using the same bootstrap re-sampling methods introduced in the previous section, Figure 2 shows the mean value of the count of Likes, Comments, and Shares from the bootstrap samples. For Likes, we can observe a mean value of 866.77. The confidence intervals of three standard deviations lie between 998.55 and 735.38 and two standard deviations range between 954.69 and 779.24. However, a mean value of 983.18 is observed from the partial dataset. In other words, there is a statistically significant difference between the mean Likes of bootstrapped samples and the observed value at 95% confidence level but not at 99.7%.

Table 2.

Mean value of user engagement metrics.

	Like	Comment	Share
Full dataset	866.77	401.60	333.71
Partial dataset	983.18	426.24	470.91

Figure 2.

User engagement metrics.

The mean count of Comments of the bootstrap samples is 401.60 and the upper bound of the confidence intervals of two standard deviations is 426.24, while the mean Comments count of the partial dataset is 426.24. Again, a statistically significant difference can be observed at 95% confidence level.

For Shares, the mean count of the random samples is 332.02. The confidence intervals of three standard deviations lie between 450.56 and 213.48. A mean value of 470.91 can be observed from the partial dataset. In other words, there is a statistically significant difference between the mean Shares of bootstrap samples and the observed value at 99.7% confidence level.

The above analysis demonstrates evidence for the biased representation of Likes, Comments, and Shares, with Shares having the strongest evidence of all three. The result is in line with the speculation that Facebook’s new Graph API is biased towards the posts with high user engagement. However, as the later sections will show, the over-representation of high user engagement post might be an indirect result of other features.

Top-term analysis

This section focuses on the terms of the Facebook posts, including the text of status updates and the caption of photos, videos, and link shares. To provide a glance of the top terms in both datasets, Figure 3 includes word clouds of top terms from each dataset. Following the methods used by Morstatter et al. (2013), this work compares the top terms in the full dataset, the partial dataset, and the bootstrap sample datasets using Kendall rank correlation coefficient.

Figure 3.

Word cloud of top terms from each dataset: (a) full data and (b) partial data.

Kendall rank correlation coefficient ( $τ_{β}$ , commonly referred to as Kendall’s tau) is a statistic used to measure the association of two ordered lists by analysing the number of concordant pairs between them. When a word is ranked higher than another word in both lists, it is considered a concordant pair, otherwise, it is regarded as discordant. In order to account for possible ties, all calculations were done with Kendall’s tau-b (Kendall, 1945), which is defined as follows $τ_{β} = \frac{| P_{C} | - | P_{D} |}{\sqrt{(| P_{C} | + | P_{D} | + | T_{A} |) (| P_{C} | + | P_{D} | + | T_{B} |)}}$ (1)where P_C is the set of concordant pairs, P_D is the set of discordant pairs, T_A and T_B are the set of ties in one list but not another. The $τ_{β}$ value ranges from –1 to 1; the larger the absolute value, the stronger the correlation. Therefore, the lower the $τ_{β}$ , the higher the chance that the dataset is to misidentify the most prominent features.

$τ_{β}$ of the top terms (n) between 10 and 1500 in steps of 10 were calculated. All terms of the full dataset, partial dataset, and each of the 1000 bootstrap datasets was ranked by its frequency. When a particular term did not appear at all in the comparison dataset, it was assigned the lowest existing rank plus one.

The bootstrap datasets were first compared with the full dataset. This process yielded 1000 estimates at each step. The mean and standard deviation of these estimates were calculated, from which confidence intervals of three standard deviations were approximated. After that, the partial dataset was compared with the full dataset and $τ_{β}$ was generated for each step, which would then be compared to the confidence intervals. Since the bootstrap datasets were generated by perfectly random resampling, the comparison allows us to assess how closely the performance of the partial dataset resembles random samples.

Figure 4 shows the value of $τ_{β}$ for n between 10 and 1500 in steps of 10. The orange line represents $τ_{β}$ of the partial dataset, the blue line represents the mean value of $τ_{β}$ from the bootstrap samples, and the shading area represents the confidence interval.

Figure 4.

Kendall’s tau of top terms.

It is observed that most $τ_{β}$ values of the partial dataset fall outside of the confidence intervals and the trend of the partial dataset exhibits an overall opposite trend to that of the random samples, these suggest that the partial dataset is likely to be biased. A closer examination reveals that the partial dataset’s $τ_{β}$ values deviate profoundly from the random samples for the smaller n numbers, especially for the n below 100, and they eventually convergence at the larger n numbers. It is reasonable to infer from the results that there are notable differences in the top terms of the partial dataset and the full dataset, and the differences only smooth out when more terms are included into the analysis.

Reverse engineering the ranking algorithm

To identify what features would determine the ranking of the posts, this work seeks to reverse engineer the ranking algorithm with logistic regression. Logistic regression is well suited for describing relationships between a binary outcome variable and one or more predictor variables (Peng et al., 2002). One key advantage of using logistic regression is that the coefficients, or weights, of the model are interpretable (Molnar, 2019). In order to decipher the ranking algorithm, the interpretability of a model’s decision is important; a well-performed but uninterpretable predictive model has only limited use for our purpose.

The full dataset is compared with the partial dataset to check if a post exists in both datasets. A new dummy variable is created to represent the ground truth of whether a post is selected by Facebook’s Graph API (selected = 1, non-selected = 0). On top of the features discussed above, the full dataset also includes data of Facebook’s new reactions (Love, Haha, Wow, Sad, Angry); to deal with the fact that the new reactions were only introduced in February 2016, a total of 57 posts published before the introduction date were removed (Facebook, 2016). Table 3 summarises the descriptive statistics of the reduced dataset.

Table 3.

Descriptive statistics.

	Non-selected	Selected
	Mean (SD)	Mean (SD)
Comments	331.13 (364.49)	462.42 (544.05)
Comment.Likes	614.00 (1001.13)	868.95 (1248.73)
Shares	146.61 (406.08)	518.48 (2003.35)
Likes	725.55 (1340.21)	981.39 (1911.60)
Love	42.51 (98.05)	75.27 (198.91)
Wow	2.05 (5.08)	3.38 (10.58)
Haha	5.31 (18.63)	6.25 (16.27)
Sad	6.16 (42.82)	3.28 (13.18)
Angry	18.07 (56.74)	19.94 (60.11)
	N	N
Link	331	131
Photo	73	242
Video	45	147

To address the skewed distributions and the non-linear relationship between the dependent and independent variables, log2 transformation was performed on all of the continuous variables, namely Comments, Comment. Likes, Shares, Likes, Love, Wow, Haha, Sad, and Angry. Due to the log2 transformation, the estimated coefficient should be interpreted as the log odds associated with every two-fold increase in the predictor.

To predict the log odds of being selected by the API, a baseline model including all the features analysed in the “Statistical measures” section was estimated $\begin{array}{l} l n (\frac{P}{1 - P}) = β_{0} + β_{1} P hot o_{i} + β_{2} V ide o_{i} + β_{3} L ike s_{i} \\ + β_{8} C omment s_{i} + β_{10} S hare s_{i} + ϵ \end{array}$ (2)where P is the probability for being selected by the API. Photo_i and Video_i are dummy variables that are equal to 1 when the type of the ith post is Photo and Video, respectively, with Link posts being the reference category. Likes_i, Comments_i, and Shares_i correspond the times of the ith post being liked, commented, and shared.

The full model includes all the variables of the baseline model and also incorporates the new reactions and Likes on Comments $\begin{array}{l} l n (\frac{P}{1 - P}) = β_{0} + β_{1} P hot o_{i} + β_{2} V ide o_{i} + β_{3} L ike s_{i} \\ + β_{4} L o v e_{i} + β_{5} W o w_{i} + β_{5} H a h a_{i} \\ + β_{6} S a d_{i} + β_{7} A ngr y_{i} + β_{8} C omment s_{i} \\ + β_{9} C omment . Like s_{i} + β_{10} S hare s_{i} + ϵ \end{array}$ (3)where Love_i, Wow_i, Haha_i, Sad_i, and Angry_i are the corresponding reaction that the ith has received. Comment.Likes_i is the total amount of Likes on the ith post’s Comments.

To select a preferred model, Table 4 shows the coefficients and goodness-of-fit statistics for both models. The baseline model posits that Photo, Video, Likes, Comments, and Shares significantly affect the odds of being selected by the API. The full model, in addition, posits that Angry and Likes on Comment have a significant association with the odds of being selected, but the effect of Comments became insignificant after the inclusion of the latter.

Table 4.

Estimated logistic regression coefficients.

	Baseline (1)	Full (2)
Photo	3.18*** (0.24)	3.35*** (0.26)
Video	3.04*** (0.29)	3.09*** (0.30)
Likes	–0.43*** (0.09)	–0.50*** (0.13)
Love		0.03 (0.07)
Wow		–0.01 (0.07)
Haha		–0.08 (0.07)
Sad		0.07 (0.09)
Angry		–0.22*** (0.06)
Comments	0.55*** (0.07)	0.01 (0.19)
Comment.Likes		0.56*** (0.15)
Shares	0.37*** (0.04)	0.40*** (0.04)
Constant	–4.10*** (0.57)	–3.76*** (0.81)
N	969	969
Log Likelihood	–430.27	–411.73
Akaike Inf. Crit.	872.55	847.47
Bayesian Inf. Crit.	901.81	905.98
Pseudo-R²	0.36	0.38

*p <0.05; **p <0.01; ***p <0.001.

The full model (Log Likelihood: –411.73; AIC: 847.47; BIC: 905.98) also demonstrates an improvement over the baseline model (Log Likelihood: –430.27; AIC: 872.55; BIC: 901.81) and the null model (Log Likelihood: –669.06; AIC: 1340.1; BIC: 1344.99) in Log Likelihood and AIC, but not BIC. To further compare the goodness-of-fit, Likelihood Ratio Tests were also conducted. The baseline model shows a statistically significant improvement in model fit when compared with the null model $(χ^{2} (5) = 477.56, p - value < 0.001)$ ; and the full model also shows a statistically significant improvement over the baseline model $(χ^{2} (6) = 37.08, p - value < 0.001)$ . This work also computes the McFadden Pseudo-R² (McFadden, 1973). Unlike R² in ordinary least squares regression, the values of McFadden Pseudo-R² tend to be considerably lower and values between 0.2 and 0.4 already represent excellent fit (McFadden, 1977). McFadden Pseudo-R² suggests that both the baseline model (Pseudo-R² = 0.36) and full model (Pseudo-R² = 0.38) fit the data excellently, while the latter shows a slight improvement over the former. Based on the above tests, the full model is selected as the preferred model for further investigation.

In order to assess the predictive power of the model, Table 5 reports the performance of the full model. The prediction accuracy was 81%, sensitivity was 90%, and specificity was 73%. A higher negative predictive value was achieved at 90% as compared to the positive predictive value of 74%. The plot in Figure 5 depicts the receiver operating characteristic (ROC) curve for the classification of the selected and non-selected posts. The area under the curve (AUC) was 0.86. The model is able to reverse engineer the ranking algorithm to a large extent and replicate the result of Facebook’s ranking algorithm for 80% of the case.

Table 5.

Performance of the full model.

	Actual values
Predicted values	0	1
0	405 (TN)	139 (FN)
1	44 (FP)	381 (TP)

TP: true-positive; FP: false-positive; TN: true-negative; FN: false-negative; Accuracy: 0.81; Sensitivity: 0.9; Specificity: 0.73; Positive predictive value: 0.74; Negative predictive value: 0.9.

Figure 5.

The diagram of the ROC curve for the full model in the classification of selected and non-selected posts.

To understand the ranking algorithm, we could look at the predictors of the model, Photo is the strongest predictor with a coefficient of 3.35, while Video is the second strongest predictor with an coefficient of 3.09, suggesting that the log odds of Photo and Video posts being selected by the API are expected to increase by a factor of 3.35 and 3.09, respectively, when compared to Link posts (reference category), holding all other variables constant. The result is in line with the observation that Photo and Video posts are over-represented in the partial dataset, while Link posts are under-represented.

It is often expected that Facebook would rank posts by user engagement, and therefore it is reasonable to expect that user engagement metrics would affect whether a post is selected by the API. Among the original user engagement metrics of Likes, Comments, and Shares, the full model posits that only Shares have a significant positive effect on the odds of being selected by the API at 95% confidence level. Holding all other variables constant, the log odds are expected to increase by a factor of 0.4 per two-fold increase in Shares. Surprisingly, Likes has a negative coefficient, suggesting that the more the Likes a post receives, the lower the odds for the post to be selected. Holding all other variables constant, the log odds are expected to decrease by a factor of 0.5 per two-fold increase in Likes. Comments became insignificant after including Likes on Comment into the model and the coefficient greatly decreased, suggesting that much of Comments’ relationship with the odds of being selected can be explained by the Likes on Comments. The log odds are expected to increase by a factor of 0.56 per two-fold increase in the total amount of Likes a post’s Comments have received. In contrast to the common belief that posts are ranked by their user engagement, the result is interesting for positing that Comments cannot predict the odds of a post being selected, but the Likes on Comments can, while Likes instead has a negative effect. Only Shares is in line with the common belief.

Among the new reactions, Angry is the only significant predictor. Holding all other variables constant, the log odds of a post being selected are expected to decrease by a factor of 0.22 per two-fold increase in Angry received. The negative coefficient of Angry could be suggestive of a ranking algorithm that takes the sentiment of users’ reaction into account. To explore this claim, Sentiment Analysis is used to investigate the difference in sentiment between the selected and non-selected posts.

Sentiment analysis

In order to compare the sentiment, the texts of the selected and non-selected posts are split into two corpora. All the comments of each post are aggregated by the post id and added into the corpora. Bing Liu’s Opinion Lexicon (Hu and Liu, 2004) was used to count the positive and negative words within each text. A sentiment score was calculated by subtracting the count of negative words from the count of positive words. An emotion score was calculated by adding the count of positive words with the count of negative words. The sentiment score denotes the polarity of the texts, while the emotion score denotes the total amount of positive and negative words used in the texts.

Table 6 summarises the sentiment score and emotion score of the post texts and comments. Since the scores are not normally distributed, Wilcoxon Rank Sum Test was performed on each score to test if the medians of the selected and non-selected texts and comments differ (Sawilowsky, 2005). The test finds no significant difference between sentiment score of post texts $(W = 126, 590, p - value = 0.31)$ and comments $(W = 132, 930, p - value = 0.68)$ at 95% confidence level. Emotion score for post texts $(W = 124, 340, p - value = 0.14)$ also shows no significant result. However, there is statistically significant difference in emotion score for comments $(W = 109, 720, p - value < 0.001)$ .

Table 6.

Sentiment and emotion scores.

	Non-selected		Selected
	Mean (SD)	Median	Mean (SD)	Median
Sentiment	0.45 (1.35)	0	0.48 (1.31)	0
Emotion	1.13 (2.13)	0	1.14 (1.55)	1
Com.Sentiment	–13.59 (73.53)	–5	−11.35 (103.28)	–11
Com.Emotion	516.11 (538.63)	382.5	664.63 (750.68)	486.5

The findings suggest that the selected posts tend to have more sentiment words in their comments. However, the findings show no evidence of bias based on the polarity of post texts and comments and sentiment words usage in post texts.

Replicating results

The above analysis was conducted with data from a single page. With only one case, it is hard to establish generalisability. There is no way to tell if the biases are a result of a systematic application of ranking algorithm by Facebook or merely a specific occurrence. This necessitates a replication using different datasets.

The Hong Kong Indigenous’s official Facebook page was selected for the replication. The full page data was collected previously for the author’s other research. The datasets were generated by two rounds of data collection: on 14 August 2016, a request was sent to search all posts dated between 11 January 2015, the creation date of the page, and 31 December 2015 on the page using Netvizz v1.3. A total of 1037 posts were collected. The partial data was collected on 5 July 2019 using Netvizz v1.6. A total of 542 posts were collected. The amount is slightly smaller than the limits of 600 per year as suggested by Facebook. Although we cannot be certain about the exact reasons behind the discrepancy without confirmation from Facebook, it can be explained in various ways: some page posts could be removed by the page owner or Facebook, and Facebook might have limited the retrievable amount in proportion to the 600-per-year-limit to account for the fact that the page was only created on 11 January 2015.

Following the same methods of the previous analysis, Figure 6 shows the mean count of Photo, Video, and Link posts (286.58, 65.37, and 146.92, respectively) of the bootstrap samples from the replication full dataset. The count of Photo, Video, and Link posts of the partial replication dataset are 319, 66, and 121, respectively. Both Photo and Link counts lie beyond the confidence intervals of three standard deviations. This work is able to replicate the over-representation of Photo posts and under-representation of Link posts. However, there is no evidence to support the over-representation of Video posts.

Figure 6.

Replication of post type analysis.

Figure 7 shows the mean values of average Likes, Comments, and Shares of the bootstrap samples (178.15, 15.02, and 19.23, respectively). The average Likes, Comments, and Shares of the partial dataset are 188.26, 16.31, and 24.25, respectively. The value for Likes lies beyond the upper bound of the confidence intervals of two standard deviations, while that of Shares lies above the confidence intervals of three standard deviations. The above analysis demonstrates evidences for the biased representation of Likes and Shares but there is no evidence to support that of Comments.

Figure 7.

Replication of user engagement metrics analysis.

Figure 8 shows the value of the replication of $τ_{β}$ . It is observed that most $τ_{β}$ values of the partial dataset fall within the confidence intervals. There is no evidence for misrepresentation of top terms in the replication data.

Figure 8.

Replication of top-terms analysis.

It is important to note that, although the Scottish National Party and Hong Kong Indigenous are both political parties, their posting behaviours and follower bases are vastly different. Among the former’s posts, Link was the most common format, followed by Photo and Video, while the most common post type for the latter was Photo. The latter also published more Status posts, which is extremely rare in the former. It is also important to take into account the discrepancy in follower bases when interpreting the results of user engagement: the Scottish National Party has over 293,000 page likes, while the Hong Kong Indigenous has only around 89,560 at the time of writing. Furthermore, the main language used in the Scottish National Party’s page was English, while most content of Hong Kong Indigenous’s page was in Chinese. The disparity in language could lead to disparate results in language-dependent analyses such as top-term analysis and sentiment analysis. The mixed result supports the hypothesis that the bias generated by the ranking algorithm could exhibit differently in pages of different types and characteristics.

To replicate the logistic regression analysis, both the baseline model and the full model were fitted again with the replication dataset. Table 7 shows the performance of the full model. The model is able to replicate the result of Facebook’s ranking algorithm for 64% of the case (Table 8).

Table 7.

Performance of the replicated model.

	Actual values
Predicted values	0	1
0	329 (TN)	222 (FN)
1	124 (FP)	280 (TP)

TP: true-positive; FP: false-positive; TN: true-negative; FN: false-negative; Accuracy: 0.64; Sensitivity: 0.73; Specificity: 0.56; Positive predictive value: 0.60; Negative predictive value: 0.69.

Table 8.

Replicated logistic regression coefficients.

	Baseline (1)	Full (2)
Photo	1.13*** (0.17)	1.13*** (0.17)
Video	0.93*** (0.23)	0.92*** (0.23)
Likes	–0.09 (0.08)	–0.10 (0.08)
Comments	–0.01 (0.06)	–0.07 (0.09)
Comment.Likes		0.06 (0.06)
Shares	0.29*** (0.04)	0.28*** (0.04)
Constant	–0.67 (0.44)	–0.49 (0.48)
N	955	955
Log Likelihood	–614.16	–613.66
Akaike Inf. Crit.	1240.32	1241.33
Bayesian Inf. Crit.	1269.49	1275.36
Pseudo-R²	0.07	0.07

*p <0.05; **p <0.01; ***p <0.001.

Concerning the predictors, the replication obtains significant results for Photo, Video, and Shares. However, Likes, Comments, and Comment.Likes are insignificant. The consistent results for Photo, Video, and Shares across different datasets provide evidence to support the case that their over-representation is the outcome of a systematic application of ranking algorithm.

Moving on to the replication of Sentiment Analysis, since the post texts of the replication dataset are in Chinese, National Taiwan University Semantic Dictionary was used instead (Ku and Chen, 2007). Table 9 summarises the replicated sentiment score and emotion score of the post texts and comments. Once again, Wilcoxon Rank Sum Test found no significant difference between sentiment score of post texts $(W = 128, 486, p - value = 0.32)$ and comments $(W = 120, 640, p - value = 0.16)$ at 95% confidence level. However, there are significant differences between Emotion score for both post texts $(W = 120, 120, p - value < 0.01)$ and comments $(W = 104, 822, p - value = 0.02)$ .

Table 9.

Replicated sentiment and emotion scores.

	Non-selected		Selected
	Mean (SD)	Median	Mean (SD)	Median
Sentiment	0.49 (3.22)	0	–0.36 (6.06)	1
Emotion	3.42 (8.18)	0	5.08 (12.58)	0
Com.Sentiment	–1.66 (5.91)	0	–2.35 (8.61)	–1
Com.Emotion	12.29 (19.98)	6	15.42 (28.22)	8

The result of the replication provides further evidence to support the findings of the Sentiment Analysis. On top of replicating the result for post comments, the replication has also found evidence to support the differences in sentiment word usage between the texts of the selected and non-selected posts.

Conclusion

This work examines whether the data obtained through Facebook’s Graph API after the introduction of the 600-post-per-year-limitation is biased. To answer this question, data of the same period were collected with the same parameters both before and after the change. The analysis begins by comparing the post types of the full and partial datasets, it has been found that Photo and Video posts are over-represented in the data returned through the new API, while Link posts are under-represented. This paper then turns to user engagement metrics and demonstrates the over-representation of posts with high user engagement in terms of Likes, Shares, and Comments.

To identify the features that would determine the selection of the posts, this work attempts to reverse engineer the ranking algorithm with logistic regression. The estimated model posits that post types, Likes, Angry, Shares, and Likes on Comment are significant predictors of the odds of being selected by the new API. In contrast with the common expectation that posts with higher user engagement would have higher ranks and therefore higher chance to be selected by the API, the model posits that Likes instead has a negative effect on the odds of being selected, while the Comments has no statistically significant association. The finding suggests that the over-representation of Likes and Comments in the partial dataset could be an indirect effect of other features. This is in line with the replicated model in which Likes and Comments are insignificant.

These findings have two implications for data-driven research. First, it has profound implication for studying social processes such as information diffusion, as the under-representation of Link posts means that a significant amount of link-sharing activities would become invisible from the API. Second, the findings suggest that there is no evidence to support the common expectation that the API would rank posts based on the number of Likes and Comments. While the selected posts seem to have more Likes and Comments, other features also have an effect on the odds of being selected. Similarly, it is questionable to assume that the new API would return all the posts with the highest user engagement. Even though it is observed that the selected posts on average have higher user engagement, some highly commented and liked posts might not be selected due to the effect of other features.

In order to explore if the ranking algorithm takes the sentiment of the post text and users’ response into account, Sentiment Analysis was performed. This work finds no evidence of differences in overall sentiment, but the findings suggest that the selected posts tend to have more sentiment words in their comments. For post text, while the original analysis finds no statistically significant difference, significant association is found in the replication. In short, it is reasonable to believe that, in addition to user engagement metrics and post type, textual content could also affect the odds of being selected. This could render posts of certain linguistic styles being filtered out. The implication is especially far-reaching for studies of public opinion, for instance, as the new API tends to return posts with more emotional texts.

Furthermore, this paper finds that there are notable differences in the top terms of the partial dataset and the random samples, suggesting that non-random factors might be influencing the representation of most prominent terms in the partial dataset. The differences could also lead to bias in text models.

This work also attempts to replicate the findings with another dataset. The mixed results in the analysis of post type, user engagement metrics, and top terms suggest that the bias in any particular feature is likely to exhibit differently in different pages. However, the replication is also able to obtain consistent results for certain features and therefore shedding light on the underlying algorithm that governs the selection of the posts. It is important to note the different time periods between the replication and original datasets; changes in the ranking algorithm could occur between them. The author recognises the limitation and encourages further replications if additional datasets become available.

Despite the issues, the data retrieved from the Graph API is still a useful resource that enables a wide range of research methods. The bias created by the ranking algorithm should not stop us from using the data altogether. Instead, it calls for caution, prudence, and critical attention when using and interpreting the data (Rieder et al., 2015). Uncovering the bias of the ranking algorithm will help researchers to better support their research results.

There are two potential problems around the data used in this work that need to be addressed. First, when requesting data from Facebook’s Graph API, the request is signed by two tokens: one for the application and another for the user. Therefore, two levels of personalisation may exist. Different applications may return different posts, while different users may also receive different posts as a result of variations in user settings. In order to minimise the effect of personalisation, the data of this work was collected using a “research account”, a new Facebook account that was registered solely for accessing Netvizz. However, this work was not able to compare the personalisation effect across different applications. Such comparison is no longer possible with the end of research access via the Graph API. Second, limitations in data access may have existed prior to the platform changes in 2017. Villegas (2016), for example, found that posts seem to disappear from pages occasionally well before 2017. It is possible for that to happen in the pages selected by this work, and therefore affecting the totality of the “full datasets”. If possible, future studies should consider using data obtained from the page administrators for the comparison.

For future works, given access to additional data, it is worthwhile to conduct further analysis on data collected from pages of different types and characteristics so as to determine whether the methodology presented here will yield similar results. In addition, further works on analysing the textual and graphical content of the selected posts could provide us with further understanding of the effect of the ranking algorithm.

In terms of the current landscape of Facebook research, Facebook has closed all research access via the Graph API since August 2019. As a result, data extraction applications like Netvizz and Netlytic no longer work, with the exception of Facepager which has been granted Page Public Content Access in December 2019 (Jünger and Keyling, 2019). That said, there are still many published, ongoing, and forthcoming research that rely on data already retrieved through the API, this work is still relevant for such works. Looking forward, there are currently two major initiatives for the future of Facebook research. First, Social Science One seeks to forge a new type of partnership between academic researchers and the private sector by providing broker data access through a secured environment (King and Persily, 2019). Second, in January 2019, Facebook announced its plan to open up access to CrowdTangle to academics and the research community (Silverman, 2019). CrowdTangle is a social analytics tool acquired by Facebook in 2016. It provides access to content from public accounts and is mainly used by newsrooms and media publishers. These initiatives are still at an early stage of development at the time of writing, but the author encourages future research on the quality of the data accessed through them when they become available.

Footnotes

Acknowledgements

The author wishes to thank the anonymous reviewers for their constructive feedback. The analysis was conducted with the help of the following R packages: stats,tidyverse,quanteda,Kendall,stargazer,caret,and ROCR. The author thanks the package authors and contributors for making this work possible. The summary of this paper was published in the Proceedings of the 30th ACM Conference on Hypertext and Social Media in Hof,Germany,2019.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: The publication fee is supported by the CAHSS Open Access Fund of the University of Edinburgh.

ORCID iD

Justin Chun-Ting Ho

References

Bakshy

Messing

Adamic

(2015) Exposure to ideologically diverse news and opinion on Facebook. Science (New York, N.Y.) 348(6239): 1130–1132.

Bessi

Zollo

Del Vicario

et al. (2016) Users polarization on Facebook and YouTube. Plos One 11(8): e0159641.

Blei

Jordan

(2003) Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.

Bodle

(2011) Regimes of sharing. Information, Communication & Society 14(3): 320–337.

Bruns

Bechmann

Burgess

et al. (2018) Facebook shuts the gate after the horse has bolted, and hurts real research in the process. Available at: https://aoir.org/facebook-shuts-the-gate-after-the-horse-has-bolted/ (accessed 30 January 2020).

Caton

Hall

Weinhardt

(2015) How do politicians use Facebook? An applied social observatory. Big Data & Society 2(2): 1--18.

Chan

(2017) The relationship between cyberbalkanization and opinion polarization: Time-Series analysis on Facebook pages and opinion polls during the Hong Kong occupy movement and the associated debate on political reform. Journal of Computer-Mediated Communication 22(5): 266–283.

Chan

Chen

Lee

FLF

(2017) Examining the roles of mobile and social media in political participation: A cross-national analysis of three Asian societies using a communication mediation approach. New Media & Society 19(12): 2003--2021.

Choudhury

Lin

Sundaram

et al. (2010) How does the data sampling strategy impact the discovery of information diffusion in social media? In: ICWSM 2010 – Proceedings of the 4th international AAAI conference on weblogs and social media. pp.34–41. Menlo Park, Calif: AAAI Press.

10.

Del Vicario

Bessi

Zollo

et al. (2016) The spreading of misinformation online. Proceedings of the National Academy of Sciences of the United States of America 113(3): 554–559.

11.

Del Vicario

Zollo

Caldarelli

et al. (2017) Mapping social dynamics on Facebook: the Brexit debate. Social Networks 50: 6–16.

12.

Driscoll

Walker

(2014) Big data, big questions – Working within a black box: Transparency in the collection and production of big twitter data. International Journal of Communication 8: 1745--1764.

13.

Duggan

(2015) Mobile messaging and social media 2015. Available at: www.pewinternet.org/2015/08/19/mobile-messaging-and-social-media-2015/ (accessed 30 January 2020).

14.

Efron

(1982) The Jackknife, the Bootstrap, and other Resampling Plans. CBMS-NSF Regional conference series in applied mathematics, volume 38. Philadelphia, PA: Society for Industrial and Applied Mathematics.

15.

Facebook (2016) Reactions now available globally. Available at: https://newsroom.fb.com/news/2016/02/reactions-now-available-globally/ (accessed 31 January 2020).

16.

Facebook (2017) /page-id/feed. Available at: https://developers.facebook.com/docs/graph-api/reference/v2.11/page/feed (accessed 31 January 2020).

17.

Facebook (2018a) Company Info. Available at: https://newsroom.fb.com/company-info/ (accessed 10 July 2019).

18.

Facebook (2018b) How news feed works. Available at: www.facebook.com/help/1155510281178725 (accessed 30 January 2020).

19.

Facebook (2018c) Overview – Graph API. Available at: https://developers.facebook.com/docs/graph-api/overview

20.

Giglietto

Rossi

Bennato

(2012) The open laboratory: Limits and possibilities of using Facebook, twitter, and YouTube as a research data source. Journal of Technology in Human Services 30(3–4): 145–159.

21.

Gulati

Williams

(2013) Social media and campaign 2012. Social Science Computer Review 31(5): 577–588.

22.

Hargittai

(2018) Potential biases in big data. Social Science Computer Review. DOI: 10.1177/0894439318788322; http://journals.sagepub.com/doi/10.1177/0894439318788322

23.

Hogan

(2008) A comparison of on and offline networks through the Facebook API. SSRN Electronic Journal DOI: 10.2139/ssrn.1331029; www.ssrn.com/abstract=1331029

24.

Liu

(2004) Mining and summarizing customer review In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’04, pp. 168–177. New York, NY, USA: ACM.

25.

Jünger J and Keyling T (2019) Facepager. An application for generic data retrieval through APIs. Available at https://github.com/strohne/Facepager/ (accessed 31 January 2020).

26.

Kendall

(1945) The treatment of ties in ranking problems. Biometrika 33(3): 239–251.

27.

KhosraviNik

Zia

(2015) Persian nationalism, identity and anti-Arab sentiments in Iranian Facebook discourses: Critical discourse analysis and social media communication. Journal of Language and Politics 13(4): 755–780.

28.

King

Persily

(2019) A new model for industry-academic partnerships. PS: Political Science & Politics 1--7. DOI: 10.1017/S1049096519001021; https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/new-model-for-industryacademic-partnerships/AD7D0B8EA582DC017D9A24754D833CAA.

29.

Chen

(2007) Mining opinions from the web: Beyond relevance retrieval. Journal of the American Society for Information Science and Technology 58(12): 1838–1850.

30.

Langlois

Elmer

McKelvey

et al. (2009) Networked publics: The double articulation of code and politics on Facebook. Canadian Journal of Communication 34(3): 415--434.

31.

Larsson

(2015) Pandering, protesting, engaging. Norwegian party leaders on Facebook during the 2013 ‘short campaign’. Information, Communication & Society 18(4): 459–473.

32.

Larsson

(2016) Online, all the time? A quantitative assessment of the permanent campaign on Facebook. New Media & Society 18(2): 274–292.

33.

Lee

PSN

CYK

Leung

(2015) Social media and umbrella movement: Insurgent public sphere in formation. Chinese Journal of Communication 8(4): 356–375.

34.

Lomborg

Bechmann

(2014) Using APIs for data collection on social media. The Information Society 30(4): 256–265.

35.

McFadden

(1973) Conditional logit analysis of qualitative choice behavior. In Zarembka P (ed.) Frontiers in Econometrics. Academic Press: New York, 1973, pp. 105–142

36.

McFadden

(1977) Quantitative methods for analyzing travel behaviour of individuals: Some recent developments. Behaviour Travel Modelling 13: 279–318.

37.

Molnar

(2019) Interpretable machine learning. Available at: https://christophm.github.io/interpretable-ml-book/ (accessed 30 January 2020).

38.

Morstatter

Pfeffer

Liu

(2014) When is it biased? Assessing the representativeness of Twitter’s Streaming {API}. In: 23rd International conference on World Wide Web, WWW 2014, volume ab1401.7, pp. 555–556. New York: Association for Computing Machinery, Inc.

39.

Morstatter

Pfeffer

Liu

et al. (2013) Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s Firehose. In: Proceedings of the seventh international conference on weblogs and social media, Cambridge, Massachusetts, USA, 8--11 July 2013.

40.

Müller

Thiesing

(2011) Social networking APIs for companies – An example of using the Facebook API for companies. In: 2011 International conference on computational aspects of social networks (CASoN), pp. 120. Piscataway, New Jersey: IEEE.

41.

Peng

CYJ

Lee

Ingersoll

(2002) An introduction to logistic regression analysis and reporting. The Journal of Educational Research 96(1): 3–14.

42.

Plantin

Lagoze

Edwards

et al. (2018) Infrastructure studies meet platform studies in the age of google and Facebook. New Media & Society 20(1): 293–310.

43.

Rieder

(2013) Studying Facebook via data extraction. In: Proceedings of the 5th annual ACM Web science conference on – WebSci ‘13. pp.346–355. New York, NY, USA: ACM Press.

44.

Rieder

(2018) Facebook’s app review and how independent research just got a lot harder. Available at: http://thepoliticsofsystems.net/2018/08/facebooks-app-review-and-how-independent-research-just-got-a-lot-harder/ (accessed 30 January 2020).

45.

Rieder

Abdulla

Poell

et al. (2015) Data critique and analytical opportunities for very large Facebook pages: Lessons learned from exploring “We are all Khaled Said”. Big Data & Society 2(2): 1--22.

46.

Sawilowsky

(2005) Misconceptions leading to choosing the t test over the Wilcoxon Mann–Whitney test for shift in location parameter. Journal of Modern Applied Statistical Methods 4(2): 598–600.

47.

Schroepfer

(2018) An update on our plans to restrict data access on Facebook. Available at: https://newsroom.fb.com/news/2018/04/restricting-data-access/ (accessed 31 January 2020).

48.

Silverman

(2019) CrowdTangle for academics and researchers. Available at: www.facebook.com/facebookmedia/blog/crowdtangle-for-academics-and-researchers (accessed 30 January 2020).

49.

Spiliotopoulos

Pereira

Oakley

(2014) Predicting tie strength with the Facebook API. In: Proceedings of the 18th Panhellenic conference on informatics, PCI ‘14, pp. 9:1–9:5. New York, NY, USA: ACM.

50.

Sun

Rosenn

Marlow

et al. (2009) Gesundheit! Modeling contagion through Facebook news feed. Available at: www.aaai.org/ocs/index.php/ICWSM/09/paper/view/185.

51.

Tang

Lee

FLF

(2013) Facebook use and political participation. Social Science Computer Review 31(6): 763–773.

52.

Tromble

Storz

Stockmann

(2017) We don’t know what we don’t know: When and how the use of twitter’s public APIs biases scientific inference. SSRN Electronic Journal. DOI: 10.2139/ssrn.3079927; www.ssrn.com/abstract=3079927

53.

Tufekci

(2013) Big data: Pitfalls, methods and concepts for an emergent field. SSRN Electronic Journal. DOI: 10.2139/ssrn.2229952; www.ssrn.com/abstract=2229952

54.

Tufekci

(2014) Big Questions for social media big data: Representativeness, validity and other methodological pitfalls. In: Proceedings of the 8th international AAAI conference on weblogs and social media, Ann Arbor, Michigan, USA, 1--4 June 2014.

55.

Van Es

Van Geenen

Boeschoten

(2014) Mediating the Black Pete discussion on Facebook: Slacktivism, flaming wars, and deliberation. First Monday 19(12).

56.

Villegas

(2016) Facebook and its disappearing posts: Data collection approaches on fan-pages for social scientists. The Journal of Social Media in Society 5(1): 160–188.

57.

Wilson

Gosling

Graham

(2012) A review of Facebook research in the social sciences. Perspectives on Psychological Science: A Journal of the Association for Psychological Science 7(3): 203–220.

How biased is the sample? Reverse engineering the ranking algorithm of Facebook’s Graph application programming interface

Abstract

Keywords

Introduction

Related work

Work with Facebook’s Graph API

Bias in black box systems

Bias in social media data

The data

Statistical measures

Post type

User engagement metrics

Top-term analysis

Reverse engineering the ranking algorithm

Sentiment analysis

Replicating results

Conclusion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

References