Sage Journals: Discover world-class research

Abstract

Twitter continuously tightens the access to its data via the publicly accessible, cost-free standard APIs. This especially applies to the follow network. In light of this, we successfully modified a network sampling method to work efficiently with the Twitter standard API in order to retrieve the most central and influential accounts of a language-based Twitter follow network: the German Twittersphere. We provide evidence that the method is able to approximate a set of the top 1% to 10% of influential accounts in the German Twittersphere in terms of activity, follower numbers, coverage, and reach. Furthermore, we demonstrate the usefulness of these data by presenting the first overview of topical communities within the German Twittersphere and their network structure. The presented data mining method opens up further avenues of enquiry, such as the collection and comparison of language-based Twitterspheres other than the German one, its further development for the collection of follow networks around certain topics or accounts of interest, and its application to other online social networks and platforms in conjunction with concepts such as agenda setting and opinion leadership.

Keywords

public sphere social media subgraph sampling data mining network analysis influence German Twittersphere follow network Twitter

Introduction

Twitter is used by individuals, grassroots movements, and political and social elites to directly communicate to the public and influence opinion (see, e.g., Rogers, 2013). The platform appears relatively accessible to researchers because the majority of accounts post publicly and its application programming interface (API) remains rather open in comparison with other online social networks (OSNs). However, after Twitter started introducing increasingly restrictive rate limits and enforcing stricter terms of service regarding the sharing of data in 2012 (see, for example, Puschmann & Burgess, 2014), the global follow network stopped being accessible to the majority of researchers (for exceptions, see Myers et al., 2014).

This development resulted in a lack of independent research on a key mechanism for information diffusion and a global infrastructure for influence. After all, despite sponsored content and algorithmic sampling, follow or subscription networks are for many OSNs still a main predictor for content exposure. While it is a widespread research practice to address this lack by using proxies for networks of attention on Twitter, such as mention, co-hashtag, or retweet networks (e.g., Himelboim et al., 2017), all of these rely on active communication. Therefore, widespread practices such as silent listening may be underrepresented.

This situation is aggravated by a recent change in how Twitter assigns account identification numbers (IDs), which allow the easy retrieval of information from its API. Its consecutive numbering scheme allowed a few independent, technically and monetarily costly projects such as the Australian Tracking Infrastructure for Social Media Analysis (TrISMA) (Bruns et al., 2016) to collect details of public accounts globally. Based on this, national follow networks could be captured and analyzed (Bruns & Enli, 2018; Bruns et al., 2017). However, Twitter closed this possibility by assigning account IDs at random, undermining further data mining efforts following this collection strategy.

These technical restrictions place considerable limitations on research that aims to assess Twitter’s relevance for issues such as news framing, opinion leadership, and intermedia agenda setting processes (Iyengar et al., 2010; Scheufele & Tewksbury, 2007; Watts & Dodds, 2007). For example, Barberá (2015) derives a measure of individual ideological scaling based on the network position of Twitter users, demonstrating that the latter reliably predicts the former. Conway et al. (2015) are able to show that intermedia agenda setting takes place in US politics, with the Twitter messages of political candidates in 2012 both predicting and echoing mainstream media messages, while Colleoni et al. (2014) investigate political homophily in US politics in networks of reciprocated and non-reciprocated ties among the followers of the two main parties. In contradiction to widespread beliefs regarding online echo chambers as largely self-contained and insular environments, Valenzuela et al. (2017) find a greater effect of Twitter on television news than vice versa. However, all such studies have in common that in those cases where they describe processes within the platform, they should be informed by a realistic view of the overall network.

Therefore, we present a large-scale test of a new Twitter follow network mining technique, building on the so-called rank degree method (Salamanos et al., 2017a, 2017b, 2017c; Voudigari et al., 2016), which we describe in the “Methods and Analysis” section below. As this walk-based technique only requires local information to sample a graph, we were able to adapt it as a data mining method for the follow networks of influential Twitter users, using the cost-free Twitter standard APIs. This approach has been tested by using the method to draw a sample of the German-speaking Twittersphere.

Our analysis shows that the sample exhibits influence and activity measures orders of magnitude higher than a random sample of the same size from a near-complete collection of German-using Twitter accounts from the 2016 TrISMA dataset (Bruns et al., 2016). This is evidence that our sample represents an influential backbone, an approximation of the proverbial, highly influential “top 10 percent” of this language-based Twittersphere. A test study employing community detection and keyword extraction shows that this network sample is suitable for investigating large-scale topical communication structures, corresponding with issue publics or communities in a language-based Twittersphere.

In light of the continuously tightening restrictions on the Twitter standard APIs, our adaptation of the rank degree method for gathering network samples is a valuable alternative to more brute-force approaches, as it can be implemented by small teams at low costs in terms of time and budget. Even though Twitter has made it almost impossible for independent researchers to capture comprehensive, large-scale follow networks, our method works around these restrictions to produce an overview of the overall structure of such networks, in this case, for the German Twittersphere.

Opportunities Opened Up by Follow Network Samples

Nation-level data about follow networks enable a multitude of avenues of enquiry. Among others, they make further data collection possible, such as constant monitoring of the most influential accounts, without being restricted to single topics. For example, this allows for a platform-independent assessment of trending topics and public opinion on a platform, of behavioral changes that are the result of adjustments made to recommendation algorithms, or of the positions, roles, and influence of automated accounts.

Combined, these data support the further development of media and communication theory as well as social theory regarding networked public spheres on an empirical basis. In the case of Australia, this has already been shown by Bruns and Highfield (2016): they can base their reappraisal of public sphere theory by Habermas (1962)—in the form of a more up-to-date concept of a networked public sphere—on a complete collection of the Australian Twitter follow network, detectable topical communities, and the localized spread of hashtags within it.

Also, research methods and paradigms that have their roots in more qualitative practices can benefit from this kind of large-scale data. For example, Dehghan (2018) grounds his discourse analysis of polarizing discussions on Twitter about the Australian Racial Discrimination Act on the same dataset.

Beyond academic research, the usefulness of large-scale social media data of this kind in a political or commercial context is clearly given. In public relations and (influencer) marketing, the benefit of being able to get an overview of the communication landscape of an OSN is obvious. Furthermore, finding answers to questions of (media) policy, for example, regarding the fragmentation of the media audience as described by McQuail (2010, pp. 444–445), can be supported with these data.

The hypothesized fragmentation of online news audiences represents another area of research in which a holistic view of national Twitterspheres can yield decisive results, in some cases allowing inferences about news consumption that extend beyond social media. For example, there is evidence for shared patterns of public attention that reach from social media to mass media on a transnational level and exhibit fragmentation that is accompanied by a high degree of audience duplication in select contexts (Fletcher & Nielsen, 2017; Tewksbury, 2005; Webster & Ksiazek, 2012). Such research can be augmented by relating the news sharing behavior on Twitter to the network structure to determine whether online news audiences are fragmented along network community structures or, conversely, no visible relationship between news preferences and network structure exists.

Problem: Restricted API Access for Researchers to Gather Follow Network Samples

This kind of data, however, has become less and less accessible in the past years. In part, this has resulted from increasingly strict privacy legislation (such as the European GDPR) taking effect and the subsequent arrangements that Twitter has made to comply with such regulation. However, the primary purpose of Twitter’s APIs has never been to support academic research, but to enable developers to build products that make use of Twitter data. Therefore, information that is not necessarily needed for such products may be withheld or not even stored, rather than being made accessible to the research community.

In fact, Twitter offers three different kinds of APIs—the standard, the premium, and the enterprise APIs (Tornes, 2017)—and the free standard APIs are the most restrictive. The Standard APIs allow its users to retrieve account information and also to query tweets. For both functions, API calls are limited and have a cooldown time. For example, when looking up the friends¹ of a user, a single API user may execute a maximum of 15 calls every 15 min, retrieving 5,000 friends each call—so a maximum of 75,000 friends of 1 to 15 accounts can be retrieved within 15 min.

The main advantages of the premium and enterprise APIs are higher rate limits for receiving tweets as well as access to a longer history of tweets and account activity-related features.² However, researchers face affordability and accessibility issues with both of these services, as access to the enterprise API is only sold through direct contact with a salesperson (prices are not mentioned online), and access to the premium API is also subjected to case-by-case approval. The latter seems to be more restrictive than for the Twitter Standard API.

In the past, researchers were able to make use of Twitter’s cost-free standard API to gather large collections of Twitter accounts, culminating in a collection by Bruns et al. (2016). Their collection method exploited the fact that Twitter assigned consecutive account IDs. It queried the free API for every consecutive possible account ID, gathering data for almost every global account in 2016. Twitter has since changed its policy of assigning account IDs to a random system. This involves much higher numbers than there are Twitter accounts, and thus renders more recent collections using this method impossible.

The APIs are also undergoing continuous changes, as certain user account properties are becoming protected, thus not accessible to researchers via the API anymore. In the past, these properties included geotags, user time zone, and the interface language used on Twitter, which are all inaccessible by now.

Objectives: Test of a Sampling Method and Data Mining of the German³ Twittersphere

Given those restrictions, this project’s purpose was two-fold: first, we were testing an adaptation of a sampling method for influential nodes in a network that has shown promising “lab” results (see below) as a data mining method “in the wild”—using the cost-free Twitter standard APIs only. Our focus here was especially to test the practical feasibility of the method for a small research team with limited resources and to explore which adaptations have to be made for it to work under this objective. Furthermore, we also probed the usefulness of the so-gained sample for some of the research opportunities mentioned above, by identifying topical communities among most influential accounts in the German-speaking Twittersphere.

Second, this project’s objective was to open up further avenues of enquiry by either providing data on the German Twittersphere for other projects (especially when this is not possible while staying compliant with Twitter’s terms of service) or to provide other researchers with the method and its implementation, that is, the code for a prototype, for sampling either the same data or other language-based Twitterspheres. It can be found under an open-source license in the Supplementary Materials of this article as well as in an online git repository.⁴ We invite other interested parties in supporting us with its further development there.

Background

While Twitter does not represent the general population of a country or language domain, it is an interesting population in itself. This has motivated a number of successful attempts to sample or completely collect other national Twitterspheres. However, all these attempts were either not based on the Twitter follow network or relied on properties of the Twitter standard APIs that are no longer available.

Representativeness of Twitter Data and Representativity of Social Network Samples in General

A common criticism of Twitter research, and social media research in general, is that its users are not representative of the general population of a country or language domain at large (e.g., Blank, 2017; Mellon & Prosser, 2017). We do not claim this kind of representativity. However, Twitter users by themselves represent a population of interest in many countries worldwide. Even in Germany, where Twitter plays a comparatively smaller role (it is ranked 16th in Germany according to Alexa,⁵ compared to rank 8 in the USA⁶ and rank 11 worldwide⁷), 4% of the German-speaking population over 14 are using Twitter at least once per week and this value appears to be stable over the last few years according to representative surveys (Frees & Koch, 2018). This makes Twitter a niche network, compared to competitors like Facebook and Instagram. However, its reach is comparable with the 3% of weekly users of audio podcasts in Germany in 2018 (Frees & Koch, 2018), or 4.5% who had a subscription to a national newspaper in Germany in 2017 (Pasquay, 2018). Moreover, the results of a study by von Nordheim et al. (2018) of broadsheet newspapers in the United Kingdom (The Guardian), the United States (The New York Times), and Germany (Süddeutsche Zeitung) indicate not only that social media “resurged massively” (von Nordheim et al., 2018) as a news source between 2009 and 2016, but also that Twitter is more commonly used as a news source by journalists than Facebook and represents an “elite channel” (von Nordheim et al., 2018). In accordance with this assessment, Wojcik and Hughes (2019) found that in the United States, highly educated and wealthy people below 49 years are overrepresented on Twitter, while Blank (2017) makes similar observations regarding the United Kingdom. While Twitter users in the United States were more Democratic leaning than the average US adult, race and gender distribution showed only minor disparities with the population average, while women and users who tweet about politics were overrepresented within the most active 10% of Twitter users.

These top 10% being responsible for 80% of the tweets within the dataset analyzed by Wojcik and Hughes (2019) points to another issue with representativity in OSNs. It lies in the fact that OSNs usually exhibit heavily skewed distributions regarding activity and connectivity. This means that a traditional understanding of representativity based on common statistical approaches, which often assume normal distributions, will not be useful. Consequently, in this article, we move our focus in terms of sample quality from typicality (i.e., traditional representativeness) to another goal: getting the most influential accounts and a backbone structure of the network at the smallest possible cost.

Related Research: Location- or Language-Based Twittersphere Collections

This project is preceded by and builds on previous successful endeavors to collect other language- or nation-based Twitterspheres. To our knowledge, the first project mapping a national Twittersphere was conducted by Ausserhofer and Maireder (2013). However, while innovative at the time in their methods and with impressive results given the exploratory nature of their study, they analyzed @-mention networks of only a few hundred accounts whose collection was based on the keywords related to Austria. Geenen et al. (2016) followed a similar approach for the Netherlands, also analyzing @-mention networks based on a keyword search for specifically Dutch terms. Ausserhofer and Maireder (2013) argue that @-mentions would be a “better” measure of influence than follower numbers. However, this claim may be an overinterpretation from a study by Cha et al. (2010), which showed that for the top 10 percent of Twitter accounts in terms of follower numbers the number of retweets and replies was only weakly rank-correlated with the follower numbers. Closer inspection of this study also shows that retweet- and mention-based influence fluctuates over time and by topic. In any case, while @-mention networks are easier to collect, because their collection is less restricted by the Twitter standard APIs, the explanatory power if used alone depends on the definition of influence and neglects the influence of highly followed accounts on silent listeners on Twitter. Especially because a reaction to a tweet with an @-mention or retweet depends on having seen this tweet first, they should be analyzed in combination with follow networks.

Bruns et al. (2017) presented an analysis of a follow network based on a comprehensive collection of Australian Twitter accounts in 2016 by TrISMA (Bruns et al., 2016). As explained above, the global dataset it was based on comprises all Twitter users of 2016 and its collection was possible due to the fact that Twitter assigned user IDs in a consecutive way. This dataset could then be filtered down to accounts who are likely to be located in Australia, and all their connections could be collected, even though with great effort. While the same method allowed the analysis of the Norwegian Twittersphere (Bruns & Enli, 2018), this has been rendered impossible by now.

Methods and Analysis

In this section, we will first describe our sampling method, which we based on the so-called rank degree method (Voudigari et al., 2016), adapted for directed networks, and optimized for practicality and efficiency with the cost-free Twitter standard APIs. After this, we describe our assessment of the sample’s quality in accordance with our sampling goals: the collected accounts’ influence as measured by coverage, reach, activity, and follower numbers. Finally, we present methods and results of a test study using our sample, which yields promising results: by means of community detection, keyword extraction from tweets, and a manual inspection of account descriptions, we are able to detect topical communities within this highly influential sample of the German Twittersphere. Our results showcase an intuitive overview of the structure of this part of the German networked public sphere that resembles preceding analyses of the Australian Twittersphere.

Sampling

Our sample was collected from the mid of December 2018 until the end of May 2019 (Figure 1). While we have almost reached our initial goal of collecting 1 million accounts (937,809 accounts were collected; the goal was mostly determined by the time and resources available for the overall project and represents about 5% of the number of German-using accounts as determined by TrISMA), our collection was stopped by the fact that the Twitter standard API has been changed and the interface language account property, on which our method relied, was made private. However, while this collection was stopped, our collection tool, as it is available today, has been updated so that it infers an account’s language by identifying the Tweet language(s).

Figure 1.

Sample size as measured by the total number of edges over time (December 17, 2018, to May 28, 2019).

The Rank Degree Method

To draw a sample of the most influential accounts within the German Twittersphere, we chose to develop and test a modification of the rank degree method.

The rank degree method (Voudigari et al., 2016) is a mostly deterministic, walk-based algorithm that only requires local information to sample a graph. Assume an undirected network (or a directed network in which all edges are reciprocal), given initial nodes as seeds and a desired sample size x. Then the rank degree process works as follows:

For each seed w, find the connected node v with the highest degree;

Update the sample with the selected edges (w, v) and the symmetric one (v, w);

Update the source graph by removing the edges selected in Step 2;

Update seeds so that only the new node v for every w is kept;

If all vs in the seeds are leaves (degree = 1), select new initial seeds;

Repeat Steps 1–5 until sample is of size x.

From the steps outlined above, it becomes evident that the process is deterministic in that it completely depends on the initial seeds chosen (and Step 5 is executed on the new initial seeds). Moreover, any selected node can be visited multiple times, but each time from a different node. If any two walkers visit the same node at the same time, those two walkers will collapse and further proceed as one. By updating the graph every time, the process ultimately alters the nodes’ degree and ranking, resulting in a dynamic sampling process.

Note that there exist two versions of this algorithm. The first version works as outlined above, while with the second version, instead of only the top connected node in Step 1, the top k connected nodes would be selected, where k is calculated by ρ * #connections(w), 0 < ρ ⩽ 1. The first version has been shown to produce better samples in general, and Salamanos et al. (2017b, 2017c) discuss the algorithm in this form. Comparing the algorithm to other graph sampling methods such as Forest Fire, Frontier Sampling, Metropolis Hastings, and random sampling methods, they find that samples generated by their algorithm preserve several graph properties to a large extent and that influential spreaders can be identified at almost the same accuracy as in the full information case when having explored only 20% of the network.

Our Adaptation and Implementation of the Rank Degree Method

With any k > 1, the algorithm would generate seeds in an exponential manner. This would render its execution with the cost-free Twitter standard APIs practically impossible. While the use of k = 1 is still within the original definition of the rank degree method, further modifications to the algorithm for the purpose of mining Twitter were necessary since the rank degree method was not defined for directed networks by Voudigari et al. (2016) or Salamanos et al. (2017a, 2017b, 2017c). Moreover, as Coscia and Rossi (2018) point out, in real-world data collections, the assessment of the quality and performance of an algorithm has to balance efficiency regarding the API restrictions with sample quality. Accordingly, we made a number of adjustments, which we explain in this section, that mainly serve a more efficient collection and easier parallelization in possible future developments.

In contrast to Voudigari et al. (2016) and Salamanos et al. (2017a, 2017b, 2017c), the full network is not available to us, therefore we had to access the required information directly from the Twitter standard API. While it was possible to increase the collection speed by using 12 API keys provided by personal accounts of the authors and other project contributors at our institutions, all API endpoints have a quota on calls per time-window and per API key basis.

Before starting the sampling algorithm, we had to define a seed pool, a collection of Twitter IDs that can be randomly sampled from to use them as initial seeds (cf. The Rank Degree Method). In our case, we used a list of all approximately 15 million Twitter accounts who had their interface language set to German according to the dataset of all Twitter accounts in 2016 as collected by TrISMA (Bruns et al., 2016). The mining process was then set in place as follows:

From the seed pool, draw a random account w;

Look up the last 5,000 friends of w and rank them by their follower count;

Choose the friend v with the highest follower count to whom there is an unburned edge and whose interface language is set to German;

If w has no friends or if all outgoing follow connections were already burned, jump to a randomly drawn seed from the seed pool;

Update the sample with the selected follow connection (w, v) and the symmetric follow connection (v, w) if it exists;

“Burn”/save the follow connection so that it cannot be walked again;

Repeat Steps 2–7 with v as the new starting point.

To use the full capacity of the API calls available to us, we used 200 parallel walkers. Contrasting to the original rank degree method, we do not let the walkers collapse when they land on the same node for time-efficiency reasons but let them execute consecutively. While close to the feasible maximum with the API restrictions in place, 200 is a massively lower number of walkers than what has been used in the tests of the original rank degree method, which was 1% of the total number of nodes. In our case this would equal over 25,000 walkers.⁸

This is not the only adaptation that we decided to make for practicality reasons. Instead of choosing the friend with the highest degree, that is, the one with the most connections, our process (1) only looks up the last 5,000 friends and then (2) chooses the one with the most followers if (3) it has its interface language set to German.

Adaption (1) was done for three reasons. First, it takes exactly one API call to access up to 5,000 most recent friends of an account. By default, only 15 API calls every 15 min are allowed for this endpoint per API key. Since there exist Twitter accounts with 50,000 friends and more, the time needed would have been increased unjustifiably for those accounts. As every account can pay only limited attention to its friends, the more friends one account has, the less they count individually. Therefore, including all friends of an account that follows several thousand accounts would give those connections an importance which they most likely do not possess. This renders spending precious API calls beyond a 5,000 friends limit unjustifiable from our perspective. Third, while the original rank degree method for undirected networks uses the degree, we use the follower number only as the (approximate) in-degree, because the number of followed accounts, or out-degree, is no indicator of an account’s influence (Cha et al., 2010).

Adaption (2) is a simplicity trade-off, acknowledging the fact that our sampling method can only be an approximation of the rank degree method. We do not possess knowledge about the whole network, and the usage of the follower number is only a heuristic for assessing influence, an approximation of the in-degree in our network of interest, the German Twittersphere. The actual in-degree within the German network is, in some cases, actually lower, due to non-German followers, and subjected to change during the months of our collection. Therefore, dynamically adapting the degree was neglected in favor of an easier parallelization of the walkers and a cache database of fixed account details, such as the friend connections and follower numbers, that led to a significant speedup of our collection.

Finally, since our aim was to collect a sample of the German Twittersphere, we used the accounts’ interface language as a filter criterion, hence (3).

Another significant change to the original algorithm is made in Step 5. Voudigari et al. (2016) and Salamanos et al. (2017a, 2017b, 2017c) only apply their method to undirected networks, so the symmetric connection is always added to the sample in their case. In the directed case, there is not necessarily a symmetric connection, but, as in the original rank degree method, we add it, if there is one. This is done because we expect this to ensure that accounts with only a few but high in-degree followers will receive more representative scores regarding centrality measures such as Page Rank, Betweenness, or Eigenvector centrality. Note, however, that, as in the original rank degree method, even though both edges are added to the sample, one of them, (v, w), can still be walked, as only the directed edge is burned. Figure 2 illustrates the sampling procedure.

Figure 2.

Our adaptation of the rank degree algorithm. The top panel represents the sample after every iteration, and the bottom panel represents the underlying network without the removed edges. The example network is based on a student interaction network (Heidler et al., 2014), filtered for in-degree > 3, as available from https://github.com/gephi/gephi/wiki/Datasets.

Evaluating the Sample Quality

As we do not possess knowledge about the whole network, assessing the sample quality, as done for the original rank degree method, was not possible for the German follow network. Neither could we expect or were aiming for any kind of typicality of properties of the sampled accounts in comparison with a population as would be necessary for a representative sample in a traditional sense. Our aim was to approximate the proverbial 1% to 10% at the top of the German Twitter population which are characterized by orders of magnitude higher activity, coverage, reach, and follower numbers than the remaining 99% to 90%.

Activity and Centrality

As the seeds are just randomly selected in order to find influential accounts, we excluded those seeds from the sample that do not have an incoming edge from another node for assessing the activity and follower numbers of our sample. This filtering left us with about 197,000 unprotected accounts, whose activities and follower numbers we could access. As can be seen in Figure 3, even in this subsample, there is a large number of accounts who have not been active for years. Whether this is due to actual inactivity or simply silent usage of the platform cannot be determined here. However, over 42% of our sample have posted at least one tweet from the beginning of 2019 until the end of our network collection in May.

Figure 3.

Distribution of the date of the last status by accounts in our sample at the end of the network collection timeframe (May 2019).

Nevertheless, as depicted in Figure 4, where we compare the distributions of the number of tweets per day since the account creation day by accounts in this sample with the same data in the subset of German-using accounts from the TrISMA collection, we can see that there is a pronounced qualitative difference in activity. The accounts in our sample are orders of magnitude more active in terms of tweets per day than this benchmark.

Figure 4.

Comparison of our sample (“Sample”) with all accounts collected in 2016 by TrISMA (“Benchmark 2016”) regarding the distribution of the statuses per day since account creation.

A similar picture is drawn if we inspect follower numbers: Again, Figure 5 shows a comparison between our sample (without seeds that remained leaves) and the entirety of Twitter accounts in 2016 that had set their interface language to German. Here too, the distributions of follower numbers show a substantial qualitative difference, with the typical follower numbers in our sample being multiple times higher than in the benchmark.

Figure 5.

Comparison of our sample with all accounts collected in 2016 by TrISMA regarding the distribution of the follower count at the time of the sample collection. The spike between 100 and 1000 accounts is caused by a fully connected bot-net.

In summary, this sample exhibits indeed a high-influence profile in terms of activity and in-degree centrality (as measured by the follower numbers reported by the Twitter API). We therefore will refer to it as the influencer sample from now on.

Coverage and Reach

However, activity alone does not translate to influence in terms of content exposure. Therefore, we tested what we call the “coverage” of our influencer sample: the typical percentage of a German-using Twitter account’s friends that are in our sample. Again and for the same reasons as above, we filtered out seeds that have no incoming edges in our sample. However, as we did not require protected information this time, the influencer sample size remained slightly higher at about 199,000 accounts for the following tests.

For this purpose, we drew a random sample of 1,000 accounts from the German TrISMA collection and retrieved their actual friends (including those with another language than German as interface language) from the Twitter API. From here on, this sample will be called the test sample. Of course, the final size of this test sample was reduced due to deleted and protected accounts. Furthermore, we excluded accounts with less than two friends to avoid misleading coverage values of 100% and 0%.⁹

As a baseline, we drew a random sample from the German-using accounts in the TrISMA collection with the size of our influencer sample (ca. 199,000 accounts). Then, we evaluated the coverage of the influencer and the baseline sample for accounts in the test sample.

As can be seen in Table 1, the mean and median of the coverage of our influencer sample is at 40%, compared to 0.5% mean and almost 0% median coverage of the baseline sample. However, as distributions are often heavily skewed in networks, mean and median do not tell the whole story. As can be seen in the distribution plots in Figures 6 to 9, our influencer sample differs extremely from the baseline sample in terms of coverage distribution, so it is evident that we observe a different class of accounts here. When ignoring accounts with 0% coverage, for the influencer sample, Figure 6 shows a distribution of coverage resembling a normal distribution around the mean/median of 40%. In other words, on average, 4 out of 10 friends of a German-using Twitter account are in our sample.

Table 1.

Count, mean, standard deviation, minimum, quartiles, and maximum of the number of friends and the percentages of friends in the influencer and baseline sample for public accounts in the test sample with at least two friends.

N = 597	Number of friends	Percentage of friends in influencer sample	Percentage of friends in baseline sample
M	57.1	40.2	0.5
SD	160.4	30.3	2.7
Min.	2	0	0
25%	7	11.4	0
50%	18	40	0
75%	42	64.7	0
Max.	1988	100	50

Figure 6.

Distribution of accounts in the test sample over the percentage of their friends that can be found in the influencer sample (filtered for in-degree ⩾1, leaving 199,180 accounts).

Figure 7.

Distribution of accounts in the test sample over the percentage of their friends that can be found in the baseline sample (199,180 accounts drawn randomly from German-using accounts in TrISMA collection).

Figure 8.

Rank-coverage distribution of accounts in the test sample with at least two friends for the influencer sample (filtered for in-degree ⩾1, leaving 199,180 accounts).

Figure 9.

Rank-coverage distribution of accounts in the test sample with at least two friends for the baseline sample (199,180 accounts drawn randomly from German-using accounts in TrISMA collection).

The categorical difference between our influencer sample and the baseline sample becomes even more clear when examining the rank-distributions of coverage: while Figure 8 shows a linear decline of coverage with rank for the influencer sample, Figure 9 illustrates that in the random baseline Twitter sample, coverage follows a seemingly exponential decline. The same holds true for the more intuitive concept of reach, that is, the percentage of accounts in the test sample reached by accounts in the influencer and baseline sample, respectively. Here, Figures 10 and 11 show that while the top 10 accounts in the influencer sample each reach 8% to 10% of the test sample, not even the top account in the baseline sample reaches 2% in the test sample. In total, the influencer sample reaches 85% of the test sample accounts with more than one friend.

Figure 10.

Rank-reach distribution of accounts in the influencer sample (filtered for in-degree ⩾1, leaving 199,180 accounts).

Figure 11.

Rank-reach distribution of accounts in the baseline sample (199,180 accounts drawn randomly from German-using accounts in TrISMA collection).

In summary, our influencer sample shows not only a class difference in activity and follower numbers compared with the average, but it also contains on average 40% of the friends of a German-using Twitter account and reaches 85% of accounts in the test sample with more than one friend. If we use 2.5 million weekly active accounts in Germany (Frees & Koch, 2018) as a conservative population estimate (instead of 15 million based on the TrISMA collection), our sample still represents less than 10% of this population. Taking this and everything above into account, we conclude that the influencer sample is a good approximation of the most influential core of the German-using Twittersphere.

Test Case: Topical Communities in the German Twittersphere

To test the suitability of our adaptation of the rank degree method to investigate the overall structure of a language-based Twittersphere, we replicated an analysis of the full Australian Twitter follow network by Münch (2019, Chapter 6) with the 3-core of the full sample.¹⁰ In the Australian case, this analysis combined community detection within the follow network and keyword extraction from the tweets of the respective communities to detect Twitter accounts with common topical interests and reveal the overall structure of the Australian Twittersphere. The filtering for the 3-core was done in order to avoid trivial star-shaped follow-back communities, which seem to be an artifact of the sampling method and affected the detection of useful communities. This filtering left us with a network of about 66,000 nodes and ca. 655,000 edges, that is, less than 10% of the full network’s accounts but over 40% of its edges. Consequently, it has to be noted again that this analysis focuses on the central core of influential accounts in the German Twittersphere and not on average German Twitter accounts.

Community Detection

Instead of the Parallel Louvain Method (PLM; Staudt et al., 2016) that was used by Münch (2019) and is based on modularity maximization, ergo on a density-based understanding of community (Coscia et al., 2011), we used the non-hierarchical, non-overlapping version of the Infomap algorithm (Rosvall & Bergstrom, 2008; Rosvall et al., 2009). This entropy-based algorithm is based on shortening the theoretical description length of the path of a random walker through the network by grouping nodes together. As a result, areas where a random walker would likely spend more time in a row are grouped together. In our case, if a tweet would be randomly shared along the network, it would, on average, stay within those communities for a longer time before leaving them. This intuitive interpretation and the fact that it allows for a directed interpretation of the network, as well as other statistical advantages of the Infomap method, led to the decision to present its results instead of the results of the modularity maximization–based algorithm.¹¹

Keyword Detection

In order to determine topical keywords for the detected communities, we retrieved the last 200 tweets for every unprotected account in the 3-core of our sample. This dataset was filtered for tweets by accounts in the 93 communities with more than 100 accounts. Then, we filtered those for tweets posted in the last 7 days. Within this dataset, only 4.4% of the active accounts had tweeted more than 200 tweets. The collection took about 2 days. Therefore, to avoid having more tweets from accounts that were collected later in the collection period, we cut-off the last 2 days of these tweets. This left us with about 455,000 tweets by ca. 20,000 accounts over a period of 5 days (June 9–14, 2019).

The keyword detection process followed the same procedure as described by Münch (2019, p. 227), except for the use of German stop-words from the python-stop-words project¹² and the use of the unfiltered communities instead of their k-cores, due to the already filtered nature of the sample’s 3-core. The keyword detection is based on the chi-square statistic and is common in corpus linguistics (Rayson et al., 2004). The process returns a list of keywords ranked by how significantly being assigned to a group is correlated with the use of these keywords. We keep the top 50 keywords¹³ and filter out keywords that have been used by less than 5%¹⁴ of the respective community.

Results

Figure 12 shows the result of the community detection. On a first glance, it becomes clear that the Infomap algorithm in most cases still finds communities that align with the force-directed layout (done with Force Atlas 2; Bastian et al., 2009)—as would a modularity maximizing algorithm. Already an inspection of the account names (the top 10 accounts by degree of communities with more than 100 active accounts within the 3-core are available in Supplemental Appendix A) revealed that most of the largest communities have a topical focus.

Figure 12.

Central communities in the 3-core of our sample network; colored by largest communities detected with the Infomap community detection algorithm (Rosvall & Bergstrom, 2008; Rosvall et al., 2009); node size represents Page Rank (Brin & Page, 1998); layout done with Force Atlas 2 in Gephi (Bastian et al., 2009); (colored version available online).

This was confirmed by a close reading of the Twitter profiles of the top 10 accounts by degree in the 3-core and triangulated with an interpretation of the keyword analysis. As the account names, the keywords and the tags reflecting our interpretation of the communities with more than 100 active accounts in the analyzed time period can be found in Supplemental Appendix A. A selection in Table 2 demonstrates the topical clarity that enabled us to summarize the keywords and accounts to a topical tag. However, as the “Hard Right” community demonstrates, it is important to stress that belonging to a community in this analysis does not necessarily mean endorsement of its majority’s activities: While most of the top 10 accounts in this community can be identified as members of the German right-wing party AfD or accounts obviously supporting this party, we also find “krone_at,” the account of an Austrian tabloid, and “MSF_austria,” the account of Doctors Without Borders in Austria. For the latter, we could determine that this is likely due to the fact that prominent accounts in the “Hard Right” cluster follow “MSF_austria.” “MSF_austria” does not follow them back.

Table 2.

Selected communities’ keywords (translated), top accounts by in-degree in the 3-core of our sample and our summarizing tags.

Active accounts	Keywords	Top accounts	Tag
2015	e3, stream, xd, e32019, nintendo, twitch, game, crossing, pc, zelda, animal, gameplay, cyberpunk2077, games, switch, xbox, trailer, cyberpunk, gaming, xboxe3, uff, nice, awesome, keanu, pk, live, nen, lol, mega	unge, dagibee, Gronkh, MelinaSophie, LeFloid, iBlali, Taddl, rewinside, HandIOfIBlood, PietSmiet	YouTubers & Gaming
1855	berlin, innen (female suffix), spd (German party), berliner, study, companies, discuss, cdu (party), demand, topics, important, federal government, german, digitisation, topic, has been, annefrank, climate protection, more, june, shows, germany, interview, politics, brandenburg	tazgezwitscher, Die_Gruenen, Tagesspiegel, c_lindner, gutjahr, dunjahayali, sigmargabriel, sixtus, HeikoMaas, spdde	German politics
1414	women’s strike, switzerland, swiss people, glarner, svp (Swiss party), bern, women’s strike2019, zurich, canton, grand, basel, national council, women	NZZ, 20 min, viktorgiacobbo, Blickch, tagesanzeiger, srfnews, MikeMuellerLate, watson_news, migros, srf3	Swiss politics / women’s strike
1044	season, trainer, bundesliga, new arrival, player, dfb, fc, gerest, em, exchanges, estonia, exchange, transfer, goal, victory, team, wm, cup, liga	DFB_Team, FCBayern, ToniKroos, MarioGoetze, esmuellert_, Podolski10, Manuel_Neuer, Bundesliga_DE, ZDFsport, JB17Official	German football
767	övp (Austrian party), spö (Austrian party), fpö (Austrian party), vienna, austria, bierlein, oenr, austria‘s, ibiza, viennese, strache, kickl, turquoise, hofer, parliament, national council, zib2, abdullah, mandate, election campaign, centre, chancellor (female form), heinz, austrian, glyphosate, proposal, blue	florianklenk, sebastiankurz, IngridThurnher, kesslermichael, HannoSettele, vanderbellen, Gawhary, HBrandstaetter, HHumorlos, michelreimon	Austrian politics
376	dessau, roßlau, afd (German (far-)right-wing party), görlitz, islam, raped, migrants, asylum seeker, dangerous person, rejected person, sed, wippel, patriots, greta, niger, hosni, strongest, green, fridayforfuture, african, rosslau, left, gretathunberg, crime, old parties, merkel, radical left, greens, habeck, keep silent, tear apart, vote, girl, refugee, saxony, rape, maas, citizen, islamistic, sexual	DonJoschi, AfD, MSF_austria, Alice_Weidel, SteinbachErika, Joerg_Meuthen, Beatrix_vStorch, GrumpyMerkel, krone_at	hard right / xenophobia / migration / refugees

Finally, the summary of this test study is depicted in the community graph in Figure 13 which contains the tags summarizing our interpretation of the keywords and top accounts. If we could not find a clear interpretation, the community is tagged as “Group of” the account with the highest degree in the 3-core. While many communities and their connections are filtered out for clarity and only the largest, most active communities and strongest connections between them remain, it gives a useful bird’s-eye view on the structure of the analyzed network, which, according to our results above, represents the influential core of the German Twittersphere. As such, it exhibits intuitively sensible patterns at first sight: Swiss Politics is strongly connected with Swiss Sports and vice versa; Hard Right, and Digital Rights Culture appear as satellites of the dominant German Politics community; Porn is remote from most communities and follows more than it is followed; and YouTubers & Gamers are connected to the rest of the network mostly through Entertainment. In short, this result resembles the results for the Australian Twittersphere by Münch (2019, Chapter 6) and provides a good overview of the influential core of the German-speaking Twittersphere.

Figure 13.

Community graph of communities in the 3-core of our sample with over 300 accounts, at least 80 active accounts during the examined timeframe, and edges with a weight of at least 150; edge width represents weight; edge direction follows clockwise curvature; edges colored by source node; node size represents the number of accounts in each community; node colors correspond with Figure 12; node labels based on interpretation of keywords and top accounts (see Supplemental Material); (colored version available online).

Conclusion

Summarizing the methods and results described in detail in the section above:

We have adapted the rank degree method in a way that makes it practically useful for a small team in order to gather Twitter follow networks of most influential accounts that use a certain interface language using the cost-free Twitter standard APIs;

We provide evidence that a network sample collected with this method exhibits activity and follower numbers in the orders of magnitude higher than the average of this language domain;

We provide evidence that the influencer sample, that is, a subsample of accounts with at least one incoming edge within our sample, represents less than 10% (likely much less) of the whole population, but reaches 85% of accounts in a random test sample of German-using accounts with more than one friend;

We show that accounts within the influencer sample exhibit substantially higher reach than accounts in a random baseline sample of the same size;

We provide evidence that for an average German account 40% of its friends are in our influencer sample; therefore, given the higher activity of the sample, likely more than 40% of an average German Twitter account’s timeline is produced by our sample.

Altogether, this lets us conclude that the adapted sampling technique is able to approximate the set of most influential accounts, based on the follow network, within a language-based Twittersphere. In comparison with the original rank degree method, our adaptation of the method is optimized to be parallelizable and efficient concerning API calls and can therefore be used by small research teams and social media professionals. Furthermore, we are confident that with some adaptations it is also suitable to mine not only Twitter follow networks but also comparable platforms that have a subscription network.

As our test study demonstrates, the data retrieved with this method enable a researcher to conduct research projects that hitherto relied on much larger datasets and data collections. Bruns et al. (2017), Bruns and Enli (2018), and Münch (2019) all relied on the full follow network that was collected based on the global Twitter account collection by TrISMA. While they filtered a complete dataset down to a manageable or useful set of likely influential accounts, our method restricts the collection to those accounts in the first place. It produces a comparatively small sample of connections in the follow network based on a random sample of accounts¹⁵ leading to a backbone network of the most central, thus likely the most influential accounts. As a positive and important side-effect, this also leads to less ethical issues, as it restricts the data collection mostly to accounts that are already popular on Twitter and therefore more likely to be aware of the public availability of their data.

Despite the smaller scale of the produced dataset, we were able to retrieve meaningful, comparable results to the three studies above, using and triangulating their methods. While the focus of this article is on the sampling method and we do not dive deeper into the theoretical implications of the test study, it is clear that the test study presents a fertile ground for theory development comparable to Bruns and Highfield (2016), for example, or discourse analysis as done by Dehghan (2018). We want to especially highlight the observed similarities between our representation of the overall structure of the German Twittersphere and the Australian Twittersphere as drawn by Bruns et al. (2017) and Münch (2019, pp. 237–238), which hints at an untapped potential for international comparative communication and (social) media studies.

Outlook

This study provides evidence that our adaptation of the rank degree method enables drawing a representative language-based sample of influential Twitter accounts. However, while we are confident that the method collects the most influential accounts despite our adaptations, for example, the exact ranking of these accounts by different centrality measures, as well as community structures, might be differently preserved than in the original method. Therefore, this method still has to be tested with known networks, as the Australian or Norwegian Twittersphere, to ensure that the centrality-preserving qualities of the original rank degree method do not suffer from our adaptations. Especially the application to a directed network, as well as the non-dynamic handling of degree and ranking in the sampling process, might lead to significant differences in the sample quality.

Nevertheless, now that the practical feasibility of the method is proven, the quality of the sample can be assessed more thoroughly, especially regarding the question how coverage and reach change with the sample size—a most important question for researchers and social media professionals.

Moreover, for the presented version of the method, we still required a high number of initial random seeds for the algorithm to work, and this seed collection is generally not given. Thus, a further development of our prototype includes the generation and growth of the seed pool. Ideas include sampling via a keyword search for common words in the language or regarding the topic of interest and a “snowballing” approach, where the latest 5,000 connections of the collected nodes are stored as seeds.¹⁶ Such an implementation needs to be tested for representativity and comparability with the current approach, as the seed pool itself is not random anymore (Münch & Rossi, 2020).

As Twitter made the interface language of an account a private property at the end of this project, the method has been adapted to work with the tweet language instead. Whether the language detection provided by Twitter suffices for this remains to be tested.

Finally, further avenues of enquiry regarding this form of sampling include the collection and comparison of language-based Twitterspheres other than German, and its development to social media mining approaches that are based on topical instead of language-based criteria (Münch & Thies, 2020).

Supplemental Material

sj-pdf-1-sms-10.1177_2056305120984475 – Supplemental material for Walking Through Twitter: Sampling a Language-Based Follow Network of Influential Twitter Accounts

Supplemental material, sj-pdf-1-sms-10.1177_2056305120984475 for Walking Through Twitter: Sampling a Language-Based Follow Network of Influential Twitter Accounts by Felix Victor Münch, Ben Thies, Cornelius Puschmann and Axel Bruns in Social Media + Society

Footnotes

We want to acknowledge the postdoctoral research network Algorithmed Public Spheres at the Leibniz-Institute for Media Research | Hans-Bredow-Institut (HBI),the TrISMA project,and our contributing colleagues at the HBI,namely,Wiebke Loosen,Christiane Matzen,Jan-Hinrik Schmidt,and Johanna Sebauer,for their trust,interest,and support for the collaborative data collection effort needed to realize this project.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iDs

Felix Victor Münch

Ben Thies

Axel Bruns

Supplemental Material

Supplemental material for this article is available online.

Author Biographies

Felix Victor Münch works as a postdoc at the Social Media Observatory of the Leibniz-Institute for Media Research | Hans-Bredow-Institut in Hamburg. With a PhD from the Digital Media Research Centre at QUT (Brisbane,Australia),an MA in Journalism (LMU and German Journalist School,Munich,Germany),a BSc in Physics (LMU,Munich,Germany),and work experience in online media brand communication as an online media concepter,user experience designer,and strategist,he is most likely a computational social scientist by now. Currently,he focusses on network science,social media,and theories regarding the public sphere.

Ben Thies (BA,Zeppelin University Friedrichshafen) is a graduate student of statistics at Humboldt-Universität zu Berlin,Freie Universität Berlin,and Technische Universität Berlin. During this study,he was a research assistant at the Leibniz-Institute for Media Research | Hans-Bredow-Institut in Hamburg. Now he works at the Mercator Research Institute on Global Commons and Climate Change. His main interests lie in human behavior and (online) social networks.

Cornelius Puschmann is professor of media and communication at ZeMKI,University of Bremen,and an affiliate researcher at the Leibniz Institute for Media Research,as well as the author of a popular German-language introduction to content analysis with R. His interests include digital media usage,online aggression,the role of algorithms for the selection of media content,and automated content analysis.

Axel Bruns is a professor in the Digital Media Research Centre at Queensland University of Technology in Brisbane,Australia,and a chief investigator in the ARC Centre of Excellence for Automated Decision-Making and Society. His books include Are Filter Bubbles Real? (2019) and Gatewatching and News Curation: Journalism,Social Media,and the Public Sphere (2018). He served as President of the Association of Internet Researchers in 2017–2019. His research blog is at

,and he tweets at @snurb_dot_info.

References

Ausserhofer

Maireder

(2013). National politics on Twitter. Information, Communication & Society, 16(3), 291–314. https://doi.org/10.1080/1369118X.2012.756050

Barberá

(2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis, 23(1), 76–91. https://doi.org/10.1093/pan/mpu011

Bastian

Heymann

Jacomy

(2009, May 17–20). Gephi: An open source software for exploring and manipulating networks [Conference session]. Proceedings of the Third International ICWSM Conference. https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewFile/154/1009

Blank

(2017). The digital divide among Twitter users and its implications for social research. Social Science Computer Review, 35(6), 679–697. https://doi.org/10.1177/0894439316671698

Brin

Page

(1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1), 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X

Bruns

Burgess

Banks

Tjondronegoro

Dreiling

Hartley

Leaver

Aly

Highfield

Wilken

Rennie

Lusher

Allen

Marshall

Demetrious

Sadkowsky

(2016). TrISMA: Tracking infrastructure for Social Media Analysis. https://trisma.org/

Bruns

Enli

(2018). The Norwegian Twittersphere: Structure and dynamics. Nordicom Review, 39(1), 129–148. https://doi.org/10.2478/nor-2018-0006

Bruns

Highfield

(2016, July). Is Habermas on Twitter? Social media and the public sphere. In Enli

Bruns

Larsson

A. O.

Skogerbo

Christensen

(Eds.), The Routledge companion to social media and politics (pp. 56–73). Routledge. https://eprints.qut.edu.au/91810/

Bruns

Moon

Münch

F. V.

Sadkowsky

(2017). The Australian Twittersphere in 2016: Mapping the follower/followee network. Social Media + Society, 3(4), 1–15. https://doi.org/10.1177/2056305117748162

10.

Cha

Haddadi

Benevenuto

Gummadi

K. P.

(2010, May 23–26). Measuring user influence in Twitter: The million follower fallacy [Conference session]. International AAAI Conference on Weblogs and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/viewPaper/1538

11.

Colleoni

Rozza

Arvidsson

(2014). Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data. Journal of Communication, 64(2), 317–332. https://doi.org/10.1111/jcom.12084

12.

Conway

B. A.

Kenski

Wang

(2015). The rise of Twitter in the political campaign: Searching for intermedia agenda-setting effects in the presidential primary. Journal of Computer-Mediated Communication, 20(4), 363–380. https://doi.org/10.1111/jcc4.12124

13.

Coscia

Giannotti

Pedreschi

(2011). A classification for community discovery methods in complex networks. Statistical Analysis and Data Mining, 4(5), 512–546. https://doi.org/10.1002/sam.10133

14.

Coscia

Rossi

(2018, December 10–13). Benchmarking API costs of network sampling strategies [Conference session]. Proceedings of the 2018 IEEE International Conference on Big Data, Big Data. https://doi.org/10.1109/BigData.2018.8622486

15.

Dehghan

(2018, July). A year of discursive struggle over freedom of speech on Twitter: What can a mixed-methods approach tell us? [Conference session]. Proceedings of the 9th International Conference on Social Media and Society. https://doi.org/10.1145/3217804.3217926

16.

Fletcher

Nielsen

R. K.

(2017). Are news audiences increasingly fragmented? A cross-national comparative analysis of cross-platform news audience fragmentation and duplication. Journal of Communication, 67(4), 476–498. https://doi.org/10.1111/jcom.12315

17.

Frees

Koch

(2018). ARD/ZDF-Onlinestudie 2018: Zuwachs bei medialer Internetnutzung und Kommunikation [ARD/ZDF online study 2018: Increase of media-based internet use and communication]. Media Perspektiven, 9, 398–413. http://www.ard-zdf-onlinestudie.de/files/2018/0918_Frees_Koch.pdf

18.

Geenen

D. V.

Boeschoten

Hekman

Bakker

Moons

(2016, October 5–8). Mining one week of Twitter. Mapping networked publics in the Dutch Twittersphere [Conference session]. Selected Papers of AoIR 2016. https://spir.aoir.org/ojs/index.php/spir/article/view/8733

19.

Habermas

(1962). Strukturwandel der Öffentlichkeit—Untersuchungen zu einer Kategorie der bürgerlichen Gesellschaft [The structural transformation of the public sphere: An inquiry into a category of bourgeois society] (1990th ed.). Suhrkamp.

20.

Heidler

Gamper

Herz

Eßer

(2014). Relationship patterns in the 19th century: The friendship network in a German boys’ school class from 1880 to 1881 revisited. Social Networks, 37, 1–13. https://doi.org/10.1016/J.SOCNET.2013.11.001

21.

Himelboim

Smith

M. A.

Rainie

Shneiderman

Espina

(2017). Classifying Twitter topic-networks using social network analysis. Social Media + Society, 3(1), 1–38. https://doi.org/10.1177/2056305117691545

22.

Iyengar

Van den Bulte

Valente

T. W.

(2010). Opinion leadership and social contagion in new product diffusion. Marketing Science, 30(2), 195–212. https://doi.org/10.1287/mksc.1100.0566

23.

McQuail

(2010). Mass communication theory (6th ed.). SAGE.

24.

Mellon

Prosser

(2017). Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users. Research & Politics, 4(3). Advance online publication. https://doi.org/10.1177/2053168017720008

25.

Münch

F. V.

(2019). Measuring the networked public—Exploring network science methods for large scale online media studies [PhD thesis, Queensland University of Technology]. https://doi.org/10.5204/thesis.eprints.125543

26.

Münch

F. V.

Rossi

(2020). Bootstrapping follow networks of influential Twitter accounts [Online poster presentation]. IC2S2. https://vimeo.com/431470176

27.

Münch

F. V.

Thies

(2020). RADICES (RAnk Degree Influencer CorE Sampler)—An efficient sampler for influential Twitter follow networks [Online presentation]. ICA 2020. https://vimeo.com/418025499

28.

Myers

S. A.

Sharma

Gupta

Lin

(2014, April). Information network or social network? The structure of the Twitter follow graph [Conference session]. WWW’14 Companion. https://doi.org/10.1145/2567948.2576939

29.

Pasquay

(2018). Die deutschen Zeitungen in Zahlen und Daten 2018 [The German newspapers in figures and data 2018]. Bundesverband Deutscher Zeitungsverleger. https://epub.sub.uni-hamburg.de/epub/volltexte/2018/76585/pdf/ZDF_2018.pdf

30.

Puschmann

Burgess

(2014). The politics of Twitter data. In Weller

Bruns

Burgess

Puschmann

Mahrt

(Eds.), Twitter and society (pp. 43–54). Peter Lang. https://eprints.qut.edu.au/67127/

31.

Rayson

Berridge

Francis

(2004, March 10). Extending the Cochran rule for the comparison of word frequencies between corpora [Conference session]. 7th International Conference on Statistical Analysis of Textual Data. http://eprints.lancs.ac.uk/12424/

32.

Rogers

(2013, May). Debanalizing Twitter: The transformation of an object of study [Conference session]. Proceedings of the 5th Annual ACM Web Science Conference on—WebSci ’13. https://doi.org/10.1145/2464464.2464511

33.

Rosvall

Axelsson

Bergstrom

C. T.

(2009). The map equation. The European Physical Journal Special Topics, 178(1), 13–23. https://doi.org/10.1140/epjst/e2010-01179-1

34.

Rosvall

Bergstrom

C. T.

(2008). Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America, 105(4), 1118–1123. https://doi.org/10.1073/pnas.0706851105

35.

Salamanos

Voudigari

Yannakoudakis

E. J.

(2017a). Deterministic graph exploration for efficient graph sampling. Social Network Analysis and Mining, 7(1), 24. https://doi.org/10.1007/s13278-017-0441-6

36.

Salamanos

Voudigari

Yannakoudakis

E. J.

(2017b). A graph exploration method for identifying influential spreaders in complex networks. Applied Network Science, 2(1), 26. https://doi.org/10.1007/s41109-017-0047-y

37.

Salamanos

Voudigari

Yannakoudakis

E. J.

(2017c). Identifying influential spreaders by graph sampling. In Cherifi

Gaito

Quattrociocchi

Sala

(Eds.), Studies in computational intelligence (Vol. 693, pp. 111–122). Springer. https://doi.org/10.1007/978-3-319-50901-3_9

38.

Scheufele

D. A.

Tewksbury

(2007). Framing, agenda setting, and priming: The evolution of three media effects models. Journal of Communication, 57(1), 9–20. https://doi.org/10.1111/j.0021-9916.2007.00326.x

39.

Staudt

C. L.

Sazonovs

Meyerhenke

(2016). NetworKit: A tool suite for large-scale complex network analysis. Network Science, 4(4), 508–530. https://doi.org/10.1017/nws.2016.20

40.

Tewksbury

(2005). The seeds of audience fragmentation: Specialization in the use of online news sites. Journal of Broadcasting & Electronic Media, 49(3), 332–348. https://doi.org/10.1207/s15506878jobem4903_5

41.

Tornes

(2017, November 14). Introducing Twitter premium APIs. Twitter Developer Blog. https://blog.twitter.com/developer/en_us/topics/tools/2017/introducing-twitter-premium-apis.html

42.

Valenzuela

Puente

Flores

P. M.

(2017). Comparing disaster news on Twitter and television: An intermedia agenda setting perspective. Journal of Broadcasting & Electronic Media, 61(4), 615–637. https://doi.org/10.1080/08838151.2017.1344673

43.

von Nordheim

Boczek

Koppers

. (2018). Sourcing the Sources: An analysis of the use of Twitter and Facebook as a journalistic source over 10 years in The New York Times, The Guardian, and Süddeutsche Zeitung. Digital Journalism, 6(7), 807–828. https://doi.org/10.1080/21670811.2018.1490658

44.

Voudigari

Salamanos

Papageorgiou

Yannakoudakis

E. J.

(2016, August 18–21). Rank degree: An efficient algorithm for graph sampling [Conference session]. Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM. https://doi.org/10.1109/ASONAM.2016.7752223

45.

Watts

D. J.

Dodds

P. S.

(2007). Influentials, networks, and public opinion formation. Journal of Consumer Research, 34(4), 441–458. https://doi.org/10.1086/518527

46.

Webster

J. G.

Ksiazek

T. B.

(2012). The dynamics of audience fragmentation: Public attention in an age of digital media. Journal of Communication, 62(1), 39–56. https://doi.org/10.1111/j.1460-2466.2011.01616.x

47.

Wojcik

Hughes

(2019, April 24). Sizing up Twitter users. Pew Research Center. https://www.pewinternet.org/2019/04/24/sizing-up-twitter-users/

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.23 MB

Walking Through Twitter: Sampling a Language-Based Follow Network of Influential Twitter Accounts

Abstract

Keywords

Introduction

Opportunities Opened Up by Follow Network Samples

Problem: Restricted API Access for Researchers to Gather Follow Network Samples

Objectives: Test of a Sampling Method and Data Mining of the German 3 Twittersphere

Background

Representativeness of Twitter Data and Representativity of Social Network Samples in General

Related Research: Location- or Language-Based Twittersphere Collections

Methods and Analysis

Sampling

The Rank Degree Method

Our Adaptation and Implementation of the Rank Degree Method

Evaluating the Sample Quality

Activity and Centrality

Coverage and Reach

Test Case: Topical Communities in the German Twittersphere

Community Detection

Keyword Detection

Results

Conclusion

Outlook

Supplemental Material

sj-pdf-1-sms-10.1177_2056305120984475 – Supplemental material for Walking Through Twitter: Sampling a Language-Based Follow Network of Influential Twitter Accounts

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Supplemental Material

Author Biographies

References

Supplementary Material

Objectives: Test of a Sampling Method and Data Mining of the German³ Twittersphere