Abstract
Keywords
Introduction
Founded in 2004, Facebook has become the most popular social networking site in the world. As of today, it has on average 1.45 billion daily active users and over 2.20 billion monthly active users (Facebook, 2018a). It was reported that around three-quarters of Internet users have a Facebook account, among them 7 in 10 access the platform daily (Duggan, 2015). The platform allows users to post and share content such as status updates, photos, links, and videos. As a result of its sheer online ubiquity, Facebook has served as a new platform on which innumerable social interactions are performed every day. This provides researchers with an unprecedented opportunity to study society and social phenomenon. Facebook research has proliferated across academic disciplines during the recent years, ranging from sociology, politics, law, economics, and psychology, to informatics, marketing, and communication studies (Wilson et al., 2012).
Like most social networking sites, Facebook provides an application programming interface (API) to third-party developers with the aim of fostering application development and integration with other services. Facebook’s Graph API is the primary way to extract data out of and upload into Facebook (Facebook, 2018c). It is an HyperText Transfer Protocol-based API that programs can use to query data, post new content, manage ads, and perform a wide range of other tasks (Müller and Thiesing, 2011). Researchers can also use the API to obtain data, while limitations on the type, quantity, and frequency of data extraction are often in place. In contrast to Twitter, which has become well known for its open data policy, Facebook has been comparatively restrictive in terms of what data can be extracted. Furthermore, Facebook also retains the right to change or terminate its data interfaces, which could lead to substantial problems for academics researchers (Rieder, 2013; Rieder et al., 2015). For instance, in reaction to the Cambridge Analytica controversy in March 2018, Facebook has announced a plan to substantially tighten the access restrictions to its API, this has led Alex Bruns, president of the Association of Internet Researchers and colleagues to issue a public response for their concerns on the collateral damage to academic research (Bruns et al., 2018; Schroepfer, 2018). After a series of changes, most of the user data, including user posts and friend lists, are no longer retrievable through the Graph API but page data was still retrievable with Page Public Content Access permission. However, in August 2019, Facebook has removed the Page Public Content Access permission for Netvizz and other data extraction applications, thereby blocking all research access via the Graph API (Rieder, 2018).
In fact, the tightening of access has begun long before the Cambridge Analytica scandal, since v2.11 (released on 7 November 2017), Facebook has limited the maximum amount of posts retrievable through the Graph API to 600 per year for any page. While the documentation states that “the API will return a maximum of 600 ranked, published posts per year” (Facebook, 2017), how these posts are selected or ranked is currently unknown. To the best of the author’s knowledge, due to the novelty of restriction, there is currently no study about the effect it may have on the data. However, gaining insight from similar problems, for example the rate limitation of the Twitter Streaming API (Morstatter et al., 2014, 2013), it is reasonable to suspect that the restriction could lead to bias in the data returned through the API if the page produces more than 600 posts per year. This paper seeks to assess the influence of the new limitation on the measures and metrics performed on the page data extracted through the new API.
This paper begins the analysis by employing statistical measures commonly used to compare multiple datasets. This work first investigates the difference in the proportion of the post type extracted by the old and the new API. This paper then compares key social media metrics, including Likes, Shares, and Comments, of the full and partial dataset. Drawing on bootstrap resampling technique, this paper also assesses how those metrics would look like for randomly drawn datasets and compare them against those of the partial dataset. Adapting the method employed by Morstatter et al. (2013) to detect biases in Twitter’s Streaming API, this paper compares the top terms obtained from the old and new API with Kendall rank correlation coefficient. This paper then attempts to reverse engineer the ranking algorithm of the new API with logistic regression and identifies the predictors of the odds of a post being selected. After that, Sentiment Analysis is used to investigate the difference in sentiment between the selected and non-selected posts. Finally, this paper also attempts to replicate the findings with data from another Facebook page.
Related work
Related prior work from three different perspectives is discussed in this section: work with Facebook’s Graph API, bias in black box systems, and bias in social media data.
Work with Facebook’s Graph API
Facebook’s Graph API has been used throughout the domain of social sciences and informatics to gain understanding of how users behave on the platform. Popular research domains that often use Facebook API include social network (Hogan, 2008; Spiliotopoulos et al., 2014), public opinion and election campaigns (Caton et al., 2015; Gulati and Williams, 2013; Larsson, 2015, 2016), political activism and participation (Chan et al., 2016; KhosraviNik and Zia, 2015; Langlois et al., 2009; Lee et al., 2015; Tang and Lee, 2013), diffusion of information and misinformation (Bakshy et al., 2015; Bessi et al., 2016; Chan and Fu, 2017; Del Vicario et al., 2016, 2017; Sun et al., 2009), online debate (Van Es et al., 2014), and so on. There is a vast amount of studies conducted using Facebook as the data source, this review can only provide a cursory glance at the wealth of literature. A more comprehensive review can be found in Wilson et al. (2012).
Apart from studies that directly obtained data through the API, there are also studies that rely on third-party applications to obtain data from Facebook. Netvizz, for example, is a data collection and extraction application that uses Facebook’s Graph API to extract data (Rieder, 2013). Studies that use these applications will also be affected by the bias of Facebook’s Graph API and therefore relevant to this work.
Bias in black box systems
This paper is related to the topic of assessing the results from black box systems. In particular, studies that assess Twitter’s APIs are especially relevant. Choudhury et al. (2010) analyses the effect different sampling methods have on the way link propagation is perceived. Morstatter et al. (2013) focuses on the sample bias from Twitter’s APIs. Their work compares four commonly studied aspects of the Streaming API and Firehose data. Using correlation, they first investigate the bias between the top hashtags in the two datasets and reveal bias in their occurrence. The authors also compare the aspects of topic model, network, and geographical locations. They find various discrepancies between the datasets in the topics extracted through latent Dirichlet allocation (Blei et al., 2003) and the authors extracted from the User × User retweet network. There is no bias found in the number of geotagged tweets. Adopting the method proposed in their previous work, Morstatter et al. (2014) propose a method to detect sample bias without the need for the “gold standard” Firehose data, which is often considered too costly for most researchers. By comparing with the Sample API, the authors are able to identify time periods in the Streaming API data where the trend of a hashtag is biased. Lastly, they also study the stability of the data when the queries are originated from different geographic areas and began at different times.
Similarly, Driscoll and Walker (2014) compare Twitter’s Streaming API with the Firehose provided by Gnip PowerTrack, one of the partner firms authorised to resell Twitter data. They suggest that while the Streaming API offers a good choice for longitudinal data collection, it performs poorly for huge, short-term events. On the contrary, PowerTrack could return very large collections of tweets posted within short periods but entails an extremely high cost in the long term. Tromble et al. (2017) also compare data from Firehose, Streaming, and Search APIs. Using Kendall’s tau and logistic regression analyses, they identify the user and content features that make a tweet more or less likely to be returned by the APIs.
Bias in social media data
The potential bias caused by the data generation and extraction process of social networking sites is well studied in the fields of social sciences and informatics. A significant amount of studies warn about limits and pitfalls of using social media data in general (Giglietto et al., 2012; Hargittai, 2018; Lomborg and Bechmann, 2014; Plantin et al., 2018; Tufekci, 2013, 2014). Hargittai (2018) discusses the representativeness of social media and points out the oversampling of people from a privileged background. Situating the problem in a wider context, Plantin et al. (2018) warn about the profit-motivated and proprietary nature of Facebook. The authors suggest that the creation of API has turned the open web consisted of URIs and repeatable HTTP transactions into “walled gardens”, where Facebook has the final control and communications are filtered by a profit-extracting sieve.
Concerning the work specifically about Facebook’s API, Müller and Thiesing (2011) provide a technical overview of the design and possible uses of the API, but there is little critical assessment. Lomborg and Bechmann (2014) suggest that APIs have an in-built bias toward the most active content contributors. Bodle (2011) provides a critical assessment of Facebook’s APIs, but the paper’s focus is on their threat to privacy, data security, transparency, and user autonomy as well as how Facebook achieves market dominance and user dependency using APIs. Rieder et al. (2015) offer a critical examination of Facebook’s Graph API and identify the potential and concern to the social studies using page data obtained through the API. The authors warn that social media data are not produced by methodological devices designed by researchers, but technical interfaces developed by the platform provider. They identify the issues of data detail, completeness, consistency, and architectural complexity, and suggest that Facebook’s Graph API does not always provide complete access to all data and therefore the data should be used with prudence and critical attentiveness.
While these works have rightly pointed out the potential bias in Facebook’s Graph API, they are not able to identify the specific direction and magnitude of the bias. Furthermore, they are unable to address the specific effect introduced by the recent changes in the API. To the best of the author’s knowledge, there is no existing study on the effect of the new 600-post-per-year limitation introduced by Facebook in November 2017. This work will fill the academic gap by investigating the potential bias with statistical measures.
The data
The Scottish National Party’s official Facebook page was selected for the analysis. It was selected due to practical reasons: as a result of the changes in Facebook’s Graph API, there is no publicly available means to obtain the full data of any page without violating Facebook’s Terms of Service at the time of writing. Only data collected before the introduction of the limitation can be used. The page data used by this work was collected previously for the author’s other research. The author recognises the limitation of using only one case, and therefore replication will be conducted using data from another page towards the end of the paper.
This work involves two rounds of data collection. First, on 7 May 2017, a request was sent to search all posts dated between 1 January 2016 and 31 December 2016 on the page using Netvizz v1.42 (Rieder, 2013), a Facebook application that extracts page or group data through the Facebook’s Graph API. A total of 1031 posts were found and extracted. Second, after the introduction of rate limitation, all posts from the same time period were searched again with exactly the same parameters and extracted with Netvizz v1.45 through the new API on 6 April 2018. The search returns a total of 598 posts, an amount very close to the limits of 600 per year as suggested by Facebook. The discrepancy between the two searches suggests the limitation was in place. It is worth noting that posts could have been deleted by the page owner between the time the first dataset was collected and the second. This could also contribute to the discrepancy between the two datasets. However, we cannot know for sure without confirmation from Facebook or the page owner.
The data collection produced two datasets, a full dataset consists of all the posts on the Facebook page (refer to as “full dataset” thereafter) and a partial dataset (refer to as “partial dataset” thereafter) consists of approximately 600 posts selected by the new API. One of the key questions asked in this work is how those 600 posts were selected. This will be discussed in the next section.
Statistical measures
This paper compares the statistical properties of the two datasets with the aim of assessing how well the characteristics of the partial dataset match those of the full dataset. This section begins by comparing the proportion of post type, then continues to compare the key social media metrics. Finally, this paper compares the top terms in the posts using rank correlation statistics.
Post type
Table 1 reports the post types and their counts in the datasets. With only one unbiased view from the full data, it is difficult to tell whether the discrepancy is significant or if it is just a small deviation from a random sample. In order to assess the randomness of the partial dataset, this paper employs the bootstrap re-sampling technique to create 1000 sample datasets (Efron, 1982). First, posts of the full data were selected uniformly at random (without replacement) until the same number of posts of the partial dataset (598) was reached. This process was repeated 1000 times. A list of 1000 datasets, each consists of 598 posts drawn randomly from the full dataset, was created. The statistical measures of these bootstrapped datasets represent how a dataset would perform if it is selected at random. By taking the mean and standard deviations of the metrics of these bootstrapped datasets, confidence intervals can be obtained. In the following figures, confidence intervals of two standard deviations and three standard deviations are plotted. Due to the nature of the bootstrapping method, observations fall beyond these confidence intervals are statistically significant at confidence levels of 95% and 99.7%, respectively.
Post count by type.
Due to the rarity of Event, Note, and Status, analysis on these post types will not be possible. Therefore, five posts of these post types were dropped from the full dataset and two from the partial dataset. This paper will focus on the post types of Link, Photo, and Video. Figure 1 shows the density of the count of Photo, Video, and Link posts. A mean value of 190.62 for Photo post is observed. Taking confidence intervals of three standard deviations, there is 99.7% chance that the count of Photo posts from a dataset drawn at random will lie between 213.55 and 167.7. However, it is observed that the partial dataset contains 301 Photo posts. This could be indicative of a filtering process in Facebook’s new API which caused an over-representation of Photo posts in the data.

Count of different post types.
A similar pattern can be observed for Video posts. A mean value of 113.43 is observed. Confidence intervals of three standard deviations suggest that the volume of photo posts would lie between 131.84 and 95.01 if the posts are selected at random. A count of 149 is observed in the partial dataset, which is higher than the upper bound of the confidence interval. The analysis also indicates an over-representation of Video posts.
Figures for Links show an opposite trend. A mean of 291.03 posts is observed. The upper bound of the confidence intervals is 314.98, while the lower bound is 267.09. However, the count of Link posts of the partial dataset is 146; the result suggests Link posts are under-represented in the partial dataset.
Bootstrap re-sampling suggests that the statistical measures of the partial dataset show statistically significant deviation from the distribution of post type in randomly drawn datasets. The new Graph API seems to over-represent Photo and Video posts, while Link posts are under-represented.
User engagement metrics
Likes, Shares, and Comments are important metrics for user engagement on Facebook and also important means for users to interact with one another on the platform. Even though Facebook’s News Feed is always described as a black box system, these metrics are important factors that influence if a post is fed to another user (Facebook, 2018b). Therefore, it is reasonable to expect they might as well affect the chance of a post being selected by the new API.
Table 2 shows the mean value of the metrics observed in the full and partial datasets. Using the same bootstrap re-sampling methods introduced in the previous section, Figure 2 shows the mean value of the count of Likes, Comments, and Shares from the bootstrap samples. For Likes, we can observe a mean value of 866.77. The confidence intervals of three standard deviations lie between 998.55 and 735.38 and two standard deviations range between 954.69 and 779.24. However, a mean value of 983.18 is observed from the partial dataset. In other words, there is a statistically significant difference between the mean Likes of bootstrapped samples and the observed value at 95% confidence level but not at 99.7%.
Mean value of user engagement metrics.

User engagement metrics.
The mean count of Comments of the bootstrap samples is 401.60 and the upper bound of the confidence intervals of two standard deviations is 426.24, while the mean Comments count of the partial dataset is 426.24. Again, a statistically significant difference can be observed at 95% confidence level.
For Shares, the mean count of the random samples is 332.02. The confidence intervals of three standard deviations lie between 450.56 and 213.48. A mean value of 470.91 can be observed from the partial dataset. In other words, there is a statistically significant difference between the mean Shares of bootstrap samples and the observed value at 99.7% confidence level.
The above analysis demonstrates evidence for the biased representation of Likes, Comments, and Shares, with Shares having the strongest evidence of all three. The result is in line with the speculation that Facebook’s new Graph API is biased towards the posts with high user engagement. However, as the later sections will show, the over-representation of high user engagement post might be an indirect result of other features.
Top-term analysis
This section focuses on the terms of the Facebook posts, including the text of status updates and the caption of photos, videos, and link shares. To provide a glance of the top terms in both datasets, Figure 3 includes word clouds of top terms from each dataset. Following the methods used by Morstatter et al. (2013), this work compares the top terms in the full dataset, the partial dataset, and the bootstrap sample datasets using Kendall rank correlation coefficient.

Word cloud of top terms from each dataset: (a) full data and (b) partial data.
Kendall rank correlation coefficient (
The bootstrap datasets were first compared with the full dataset. This process yielded 1000 estimates at each step. The mean and standard deviation of these estimates were calculated, from which confidence intervals of three standard deviations were approximated. After that, the partial dataset was compared with the full dataset and
Figure 4 shows the value of

Kendall’s tau of top terms.
It is observed that most
Reverse engineering the ranking algorithm
To identify what features would determine the ranking of the posts, this work seeks to reverse engineer the ranking algorithm with logistic regression. Logistic regression is well suited for describing relationships between a binary outcome variable and one or more predictor variables (Peng et al., 2002). One key advantage of using logistic regression is that the coefficients, or weights, of the model are interpretable (Molnar, 2019). In order to decipher the ranking algorithm, the interpretability of a model’s decision is important; a well-performed but uninterpretable predictive model has only limited use for our purpose.
The full dataset is compared with the partial dataset to check if a post exists in both datasets. A new dummy variable is created to represent the ground truth of whether a post is selected by Facebook’s Graph API (selected = 1, non-selected = 0). On top of the features discussed above, the full dataset also includes data of Facebook’s new reactions (Love, Haha, Wow, Sad, Angry); to deal with the fact that the new reactions were only introduced in February 2016, a total of 57 posts published before the introduction date were removed (Facebook, 2016). Table 3 summarises the descriptive statistics of the reduced dataset.
Descriptive statistics.
To address the skewed distributions and the non-linear relationship between the dependent and independent variables, log2 transformation was performed on all of the continuous variables, namely Comments, Comment. Likes, Shares, Likes, Love, Wow, Haha, Sad, and Angry. Due to the log2 transformation, the estimated coefficient should be interpreted as the log odds associated with every two-fold increase in the predictor.
To predict the log odds of being selected by the API, a baseline model including all the features analysed in the “Statistical measures” section was estimated
The full model includes all the variables of the baseline model and also incorporates the new reactions and Likes on Comments
To select a preferred model, Table 4 shows the coefficients and goodness-of-fit statistics for both models. The baseline model posits that Photo, Video, Likes, Comments, and Shares significantly affect the odds of being selected by the API. The full model, in addition, posits that Angry and Likes on Comment have a significant association with the odds of being selected, but the effect of Comments became insignificant after the inclusion of the latter.
Estimated logistic regression coefficients.
*
The full model (Log Likelihood: –411.73; AIC: 847.47; BIC: 905.98) also demonstrates an improvement over the baseline model (Log Likelihood: –430.27; AIC: 872.55; BIC: 901.81) and the null model (Log Likelihood: –669.06; AIC: 1340.1; BIC: 1344.99) in Log Likelihood and AIC, but not BIC. To further compare the goodness-of-fit, Likelihood Ratio Tests were also conducted. The baseline model shows a statistically significant improvement in model fit when compared with the null model
In order to assess the predictive power of the model, Table 5 reports the performance of the full model. The prediction accuracy was 81%, sensitivity was 90%, and specificity was 73%. A higher negative predictive value was achieved at 90% as compared to the positive predictive value of 74%. The plot in Figure 5 depicts the receiver operating characteristic (ROC) curve for the classification of the selected and non-selected posts. The area under the curve (AUC) was 0.86. The model is able to reverse engineer the ranking algorithm to a large extent and replicate the result of Facebook’s ranking algorithm for 80% of the case.
Performance of the full model.
TP: true-positive; FP: false-positive; TN: true-negative; FN: false-negative; Accuracy: 0.81; Sensitivity: 0.9; Specificity: 0.73; Positive predictive value: 0.74; Negative predictive value: 0.9.

The diagram of the ROC curve for the full model in the classification of selected and non-selected posts.
To understand the ranking algorithm, we could look at the predictors of the model, Photo is the strongest predictor with a coefficient of 3.35, while Video is the second strongest predictor with an coefficient of 3.09, suggesting that the log odds of Photo and Video posts being selected by the API are expected to increase by a factor of 3.35 and 3.09, respectively, when compared to Link posts (reference category), holding all other variables constant. The result is in line with the observation that Photo and Video posts are over-represented in the partial dataset, while Link posts are under-represented.
It is often expected that Facebook would rank posts by user engagement, and therefore it is reasonable to expect that user engagement metrics would affect whether a post is selected by the API. Among the original user engagement metrics of Likes, Comments, and Shares, the full model posits that only Shares have a significant positive effect on the odds of being selected by the API at 95% confidence level. Holding all other variables constant, the log odds are expected to increase by a factor of 0.4 per two-fold increase in Shares. Surprisingly, Likes has a negative coefficient, suggesting that the more the Likes a post receives, the lower the odds for the post to be selected. Holding all other variables constant, the log odds are expected to decrease by a factor of 0.5 per two-fold increase in Likes. Comments became insignificant after including Likes on Comment into the model and the coefficient greatly decreased, suggesting that much of Comments’ relationship with the odds of being selected can be explained by the Likes on Comments. The log odds are expected to increase by a factor of 0.56 per two-fold increase in the total amount of Likes a post’s Comments have received. In contrast to the common belief that posts are ranked by their user engagement, the result is interesting for positing that Comments cannot predict the odds of a post being selected, but the Likes on Comments can, while Likes instead has a negative effect. Only Shares is in line with the common belief.
Among the new reactions, Angry is the only significant predictor. Holding all other variables constant, the log odds of a post being selected are expected to decrease by a factor of 0.22 per two-fold increase in Angry received. The negative coefficient of Angry could be suggestive of a ranking algorithm that takes the sentiment of users’ reaction into account. To explore this claim, Sentiment Analysis is used to investigate the difference in sentiment between the selected and non-selected posts.
Sentiment analysis
In order to compare the sentiment, the texts of the selected and non-selected posts are split into two corpora. All the comments of each post are aggregated by the post id and added into the corpora. Bing Liu’s Opinion Lexicon (Hu and Liu, 2004) was used to count the positive and negative words within each text. A sentiment score was calculated by subtracting the count of negative words from the count of positive words. An emotion score was calculated by adding the count of positive words with the count of negative words. The sentiment score denotes the polarity of the texts, while the emotion score denotes the total amount of positive and negative words used in the texts.
Table 6 summarises the sentiment score and emotion score of the post texts and comments. Since the scores are not normally distributed, Wilcoxon Rank Sum Test was performed on each score to test if the medians of the selected and non-selected texts and comments differ (Sawilowsky, 2005). The test finds no significant difference between sentiment score of post texts
Sentiment and emotion scores.
The findings suggest that the selected posts tend to have more sentiment words in their comments. However, the findings show no evidence of bias based on the polarity of post texts and comments and sentiment words usage in post texts.
Replicating results
The above analysis was conducted with data from a single page. With only one case, it is hard to establish generalisability. There is no way to tell if the biases are a result of a systematic application of ranking algorithm by Facebook or merely a specific occurrence. This necessitates a replication using different datasets.
The Hong Kong Indigenous’s official Facebook page was selected for the replication. The full page data was collected previously for the author’s other research. The datasets were generated by two rounds of data collection: on 14 August 2016, a request was sent to search all posts dated between 11 January 2015, the creation date of the page, and 31 December 2015 on the page using Netvizz v1.3. A total of 1037 posts were collected. The partial data was collected on 5 July 2019 using Netvizz v1.6. A total of 542 posts were collected. The amount is slightly smaller than the limits of 600 per year as suggested by Facebook. Although we cannot be certain about the exact reasons behind the discrepancy without confirmation from Facebook, it can be explained in various ways: some page posts could be removed by the page owner or Facebook, and Facebook might have limited the retrievable amount in proportion to the 600-per-year-limit to account for the fact that the page was only created on 11 January 2015.
Following the same methods of the previous analysis, Figure 6 shows the mean count of Photo, Video, and Link posts (286.58, 65.37, and 146.92, respectively) of the bootstrap samples from the replication full dataset. The count of Photo, Video, and Link posts of the partial replication dataset are 319, 66, and 121, respectively. Both Photo and Link counts lie beyond the confidence intervals of three standard deviations. This work is able to replicate the over-representation of Photo posts and under-representation of Link posts. However, there is no evidence to support the over-representation of Video posts.

Replication of post type analysis.
Figure 7 shows the mean values of average Likes, Comments, and Shares of the bootstrap samples (178.15, 15.02, and 19.23, respectively). The average Likes, Comments, and Shares of the partial dataset are 188.26, 16.31, and 24.25, respectively. The value for Likes lies beyond the upper bound of the confidence intervals of two standard deviations, while that of Shares lies above the confidence intervals of three standard deviations. The above analysis demonstrates evidences for the biased representation of Likes and Shares but there is no evidence to support that of Comments.

Replication of user engagement metrics analysis.
Figure 8 shows the value of the replication of

Replication of top-terms analysis.
It is important to note that, although the Scottish National Party and Hong Kong Indigenous are both political parties, their posting behaviours and follower bases are vastly different. Among the former’s posts, Link was the most common format, followed by Photo and Video, while the most common post type for the latter was Photo. The latter also published more Status posts, which is extremely rare in the former. It is also important to take into account the discrepancy in follower bases when interpreting the results of user engagement: the Scottish National Party has over 293,000 page likes, while the Hong Kong Indigenous has only around 89,560 at the time of writing. Furthermore, the main language used in the Scottish National Party’s page was English, while most content of Hong Kong Indigenous’s page was in Chinese. The disparity in language could lead to disparate results in language-dependent analyses such as top-term analysis and sentiment analysis. The mixed result supports the hypothesis that the bias generated by the ranking algorithm could exhibit differently in pages of different types and characteristics.
To replicate the logistic regression analysis, both the baseline model and the full model were fitted again with the replication dataset. Table 7 shows the performance of the full model. The model is able to replicate the result of Facebook’s ranking algorithm for 64% of the case (Table 8).
Performance of the replicated model.
TP: true-positive; FP: false-positive; TN: true-negative; FN: false-negative; Accuracy: 0.64; Sensitivity: 0.73; Specificity: 0.56; Positive predictive value: 0.60; Negative predictive value: 0.69.
Replicated logistic regression coefficients.
*
Concerning the predictors, the replication obtains significant results for Photo, Video, and Shares. However, Likes, Comments, and Comment.Likes are insignificant. The consistent results for Photo, Video, and Shares across different datasets provide evidence to support the case that their over-representation is the outcome of a systematic application of ranking algorithm.
Moving on to the replication of Sentiment Analysis, since the post texts of the replication dataset are in Chinese, National Taiwan University Semantic Dictionary was used instead (Ku and Chen, 2007). Table 9 summarises the replicated sentiment score and emotion score of the post texts and comments. Once again, Wilcoxon Rank Sum Test found no significant difference between sentiment score of post texts
Replicated sentiment and emotion scores.
The result of the replication provides further evidence to support the findings of the Sentiment Analysis. On top of replicating the result for post comments, the replication has also found evidence to support the differences in sentiment word usage between the texts of the selected and non-selected posts.
Conclusion
This work examines whether the data obtained through Facebook’s Graph API after the introduction of the 600-post-per-year-limitation is biased. To answer this question, data of the same period were collected with the same parameters both before and after the change. The analysis begins by comparing the post types of the full and partial datasets, it has been found that Photo and Video posts are over-represented in the data returned through the new API, while Link posts are under-represented. This paper then turns to user engagement metrics and demonstrates the over-representation of posts with high user engagement in terms of Likes, Shares, and Comments.
To identify the features that would determine the selection of the posts, this work attempts to reverse engineer the ranking algorithm with logistic regression. The estimated model posits that post types, Likes, Angry, Shares, and Likes on Comment are significant predictors of the odds of being selected by the new API. In contrast with the common expectation that posts with higher user engagement would have higher ranks and therefore higher chance to be selected by the API, the model posits that Likes instead has a negative effect on the odds of being selected, while the Comments has no statistically significant association. The finding suggests that the over-representation of Likes and Comments in the partial dataset could be an indirect effect of other features. This is in line with the replicated model in which Likes and Comments are insignificant.
These findings have two implications for data-driven research. First, it has profound implication for studying social processes such as information diffusion, as the under-representation of Link posts means that a significant amount of link-sharing activities would become invisible from the API. Second, the findings suggest that there is no evidence to support the common expectation that the API would rank posts based on the number of Likes and Comments. While the selected posts seem to have more Likes and Comments, other features also have an effect on the odds of being selected. Similarly, it is questionable to assume that the new API would return all the posts with the highest user engagement. Even though it is observed that the selected posts on average have higher user engagement, some highly commented and liked posts might not be selected due to the effect of other features.
In order to explore if the ranking algorithm takes the sentiment of the post text and users’ response into account, Sentiment Analysis was performed. This work finds no evidence of differences in overall sentiment, but the findings suggest that the selected posts tend to have more sentiment words in their comments. For post text, while the original analysis finds no statistically significant difference, significant association is found in the replication. In short, it is reasonable to believe that, in addition to user engagement metrics and post type, textual content could also affect the odds of being selected. This could render posts of certain linguistic styles being filtered out. The implication is especially far-reaching for studies of public opinion, for instance, as the new API tends to return posts with more emotional texts.
Furthermore, this paper finds that there are notable differences in the top terms of the partial dataset and the random samples, suggesting that non-random factors might be influencing the representation of most prominent terms in the partial dataset. The differences could also lead to bias in text models.
This work also attempts to replicate the findings with another dataset. The mixed results in the analysis of post type, user engagement metrics, and top terms suggest that the bias in any particular feature is likely to exhibit differently in different pages. However, the replication is also able to obtain consistent results for certain features and therefore shedding light on the underlying algorithm that governs the selection of the posts. It is important to note the different time periods between the replication and original datasets; changes in the ranking algorithm could occur between them. The author recognises the limitation and encourages further replications if additional datasets become available.
Despite the issues, the data retrieved from the Graph API is still a useful resource that enables a wide range of research methods. The bias created by the ranking algorithm should not stop us from using the data altogether. Instead, it calls for caution, prudence, and critical attention when using and interpreting the data (Rieder et al., 2015). Uncovering the bias of the ranking algorithm will help researchers to better support their research results.
There are two potential problems around the data used in this work that need to be addressed. First, when requesting data from Facebook’s Graph API, the request is signed by two tokens: one for the application and another for the user. Therefore, two levels of personalisation may exist. Different applications may return different posts, while different users may also receive different posts as a result of variations in user settings. In order to minimise the effect of personalisation, the data of this work was collected using a “research account”, a new Facebook account that was registered solely for accessing Netvizz. However, this work was not able to compare the personalisation effect across different applications. Such comparison is no longer possible with the end of research access via the Graph API. Second, limitations in data access may have existed prior to the platform changes in 2017. Villegas (2016), for example, found that posts seem to disappear from pages occasionally well before 2017. It is possible for that to happen in the pages selected by this work, and therefore affecting the totality of the “full datasets”. If possible, future studies should consider using data obtained from the page administrators for the comparison.
For future works, given access to additional data, it is worthwhile to conduct further analysis on data collected from pages of different types and characteristics so as to determine whether the methodology presented here will yield similar results. In addition, further works on analysing the textual and graphical content of the selected posts could provide us with further understanding of the effect of the ranking algorithm.
In terms of the current landscape of Facebook research, Facebook has closed all research access via the Graph API since August 2019. As a result, data extraction applications like Netvizz and Netlytic no longer work, with the exception of Facepager which has been granted Page Public Content Access in December 2019 (Jünger and Keyling, 2019). That said, there are still many published, ongoing, and forthcoming research that rely on data already retrieved through the API, this work is still relevant for such works. Looking forward, there are currently two major initiatives for the future of Facebook research. First, Social Science One seeks to forge a new type of partnership between academic researchers and the private sector by providing broker data access through a secured environment (King and Persily, 2019). Second, in January 2019, Facebook announced its plan to open up access to CrowdTangle to academics and the research community (Silverman, 2019). CrowdTangle is a social analytics tool acquired by Facebook in 2016. It provides access to content from public accounts and is mainly used by newsrooms and media publishers. These initiatives are still at an early stage of development at the time of writing, but the author encourages future research on the quality of the data accessed through them when they become available.
