vignettes/centrality_maesures_update.Rmd
centrality_maesures_update.Rmd
The new release of ConnectR provides some aesthetic upgrades and built-in calculation of node centrality measures. These upgrades provide powerful tools to measure networks with, and give users more flexibility of approach.
With great power comes great responsibility! It is expected, for
example, that by the time the user comes to create a centrality network
he/she knows whether their network should be treated as directed or
undirected, and what this choice means for each respective centrality
measure. For example, if a network is directed and acyclic (no path from
one node to all others) then Eigenvector Centrality
is not
defined and will = 0 in all cases. In undirected networks,
degree_in
and degree_out
collapse into
degree
- or the total number of edges associated with a
node.
For a primer on centrality measures for networks, see: Wikipedia Centrality Measures
Let’s load the libraries we will need and get some data:
library(ConnectR)
library(tidyverse)
rts <- retweet_example %>%
janitor::clean_names()
The first major change to the workflow is the
calculate_centrality
function. The function takes in a
dataset, asks for a from
and to
variable, asks
whether the network should be treated as directed
, and what
the damping
factor should be for the PageRank
algorithm.
A quick word on directed vs undirected networks. Loosely speaking, a directed network is a network in which the edges - or the connection between nodes - are asymmetrical, i.e. if User A retweets user B, but user B does not retweet user A, there is a path from A –> B but there is not a path from B –> A. A retweet network is therefore directed, as is a mention network.
An undirected network is a network where the edges are symmetrical, i.e. if A is friends with B on Facebook, then B is also friends with A. Therefore, the edge that connects them creates a path from A – > B and a path from B –> A.
Notice a warning about ‘nobigint’ which is an argument in the calculation for betweenness - notice it and then ignore it :).
(rt_centrality <- rts %>%
calculate_centrality(from = sender, to = original_author, directed = TRUE))
#> Warning in betweenness(graph = graph, v = V(graph), directed = directed, :
#> 'nobigint' is deprecated since igraph 1.3 and will be removed in igraph 1.4
#> Warning in eigen_centrality(graph = .G(), directed = directed, scale = scale, :
#> At core/centrality/centrality_other.c:329 : graph is directed and acyclic;
#> eigenvector centralities will be zeros.
#> $edges
#> # A tibble: 921 × 13
#> from to n_retweets degree_in_to degree_out_to betweenness_to page_rank_to
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 "–ë–… Rona… 1 245 0 0 0.135
#> 2 "–í–… Rona… 1 245 0 0 0.135
#> 3 "–ü–… Rona… 1 245 0 0 0.135
#> 4 "–û–… Rona… 1 245 0 0 0.135
#> 5 "‚òÅ… mark… 1 60 0 0 0.0333
#> 6 "(A.… Rona… 1 245 0 0 0.135
#> 7 "‡§Ö… stpi… 1 130 0 0 0.0581
#> 8 "‡§Ö… Omka… 1 123 0 0 0.0539
#> 9 "‡§Ö… stpi… 1 130 0 0 0.0581
#> 10 "‡§µ… mark… 1 60 0 0 0.0333
#> # … with 911 more rows, and 6 more variables: eigen_to <dbl>,
#> # degree_in_from <dbl>, degree_out_from <dbl>, betweenness_from <dbl>,
#> # page_rank_from <dbl>, eigen_from <dbl>
#>
#> $user_stats
#> # A tibble: 879 × 8
#> name degree_in degree_out betweenness page_rank eigen user_retweets_o…
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 –ë–æ–≥–¥–∞… 0 1 0 0.000656 0 1
#> 2 Ronald_van… 245 0 0 0.135 0 NA
#> 3 –í–ª–∞–¥–∏… 0 1 0 0.000656 0 1
#> 4 –ü–∞–≤–µ–ª… 0 1 0 0.000656 0 1
#> 5 –û–ª–µ–≥ –… 0 1 0 0.000656 0 1
#> 6 ‚òÅÔ∏èStev… 0 1 0 0.000656 0 1
#> 7 markrussin… 60 0 0 0.0333 0 NA
#> 8 (A.R.A.)® 0 1 0 0.000656 0 1
#> 9 ‡§Ö‡§Æ‡§® … 0 1 0 0.000656 0 1
#> 10 stpiindia 130 0 0 0.0581 0 NA
#> # … with 869 more rows, and 1 more variable: user_gets_retweeted <int>
The function returns a list of 2 items - edges
,
user_stats
. edges
will be plucked and passed
on to the viz_centrality_network()
or
create_supercluster()
functions, user_stats
is
a data frame providing author-level metrics, useful for creating a
custom score or for general exploration.
If we purrr::pluck()
our edges, we can take a closer
look at what the calculate_centrality()
function has done
for us:
edges <- rt_centrality %>% pluck('edges')
glimpse(edges)
#> Rows: 921
#> Columns: 13
#> $ from <chr> "–ë–æ–≥–¥–∞–Ω –ó—É–±–µ—Ü", "–í–ª–∞–¥–∏–º–∏—Ä –ò–≤–∞–Ω…
#> $ to <chr> "Ronald_vanLoon", "Ronald_vanLoon", "Ronald_vanLoon",…
#> $ n_retweets <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ degree_in_to <dbl> 245, 245, 245, 245, 60, 245, 130, 123, 130, 60, 123, …
#> $ degree_out_to <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ betweenness_to <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ page_rank_to <dbl> 0.1352045763, 0.1352045763, 0.1352045763, 0.135204576…
#> $ eigen_to <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ degree_in_from <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ degree_out_from <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ betweenness_from <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ page_rank_from <dbl> 0.0006564695, 0.0006564695, 0.0006564695, 0.000656469…
#> $ eigen_from <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
We now have columns named from
and to
which
tell us the direction of the interaction, i.e. from
retweeted to.
n_retweets
tells us how many
times from
has retweeted to
in our dataset,
and we have centrality measures for to
and centrality
measures for from
. This is a potentially confusing output,
so why have we done it this way?
A user may want to reduce the size of the network for a variety of
reasons, and it is important to be able to use centrality metrics to
filter on from
and to
.
So what do our other new columns mean?
degree_in_x
= the number of times a user was retweeted
in total
degree_out_x
= the number of times a user retweeted
others in total
betweenness_x
= the number of shortest paths between two
other nodes a user figures in
page_rank_to
= roughly the proportion of times a user
travelling around the network at random would end up at a node.
eigen_x
= the Eigenvector Centrality score for a node -
roughly speaking this estimates a node’s influence by considering the
influence of its connections too. The highest scoring node will always =
1.
We can add variables such as followers
back to our
dataset if we’re going to want to use them as size or colour when we
visualise the network.Ot’s important to note that in the case of
followers
, we have information on the from
variable - i.e. the person who is retweeting, not the person who is
being retweeted.
We can use our new followers
variable to get the mean
number of followers interacting with each original_author like so:
tmp_join <- rts %>% select(sender, followers)
edges <- edges %>% left_join(tmp_join, by = c("from" = "sender"))
edges <- edges %>%
group_by(to) %>%
#Get the mean of the followers who have retweeted a user.
mutate(mean_followers = mean(followers, na.rm = TRUE)) %>%
ungroup()
rm(tmp_join)
It’s at this point that the user should determine which metrics will figure as node size and node colour. Again, the onus is very much on the user to interpret the statistics they now find themselves with. Given the complexity of network analysis, there is not currently an algorithmic approach to deciding which metrics should be focused on - it is dependent on the type of network being analysed, and the information the user is trying to convey.
It’s a good idea to use standard data exploration/visualisation techniques to investigate which statistics may be useful for size and colour. You can create histograms, boxplots, and other standard charts for each variable, but for now we will take a quick look at some summary statistics.
summary(edges)
#> from to n_retweets degree_in_to
#> Length:7186 Length:7186 Min. : 1.000 Min. : 1.00
#> Class :character Class :character 1st Qu.: 1.000 1st Qu.: 1.00
#> Mode :character Mode :character Median : 1.000 Median : 2.00
#> Mean : 1.321 Mean : 24.01
#> 3rd Qu.: 1.000 3rd Qu.: 7.00
#> Max. :15.000 Max. :245.00
#> degree_out_to betweenness_to page_rank_to eigen_to degree_in_from
#> Min. :0 Min. :0 Min. :0.0006648 Min. :0 Min. :0
#> 1st Qu.:0 1st Qu.:0 1st Qu.:0.0006648 1st Qu.:0 1st Qu.:0
#> Median :0 Median :0 Median :0.0012228 Median :0 Median :0
#> Mean :0 Mean :0 Mean :0.0122872 Mean :0 Mean :0
#> 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.0031758 3rd Qu.:0 3rd Qu.:0
#> Max. :0 Max. :0 Max. :0.1352046 Max. :0 Max. :0
#> degree_out_from betweenness_from page_rank_from eigen_from
#> Min. : 1.00 Min. :0 Min. :0.0006565 Min. :0
#> 1st Qu.:67.00 1st Qu.:0 1st Qu.:0.0006565 1st Qu.:0
#> Median :67.00 Median :0 Median :0.0006565 Median :0
#> Mean :55.95 Mean :0 Mean :0.0006565 Mean :0
#> 3rd Qu.:67.00 3rd Qu.:0 3rd Qu.:0.0006565 3rd Qu.:0
#> Max. :67.00 Max. :0 Max. :0.0006565 Max. :0
#> followers mean_followers
#> Min. : 0 Min. : 4
#> 1st Qu.: 2053 1st Qu.: 2569
#> Median : 2639 Median : 2597
#> Mean : 4039 Mean : 4039
#> 3rd Qu.: 2911 3rd Qu.: 2609
#> Max. :643702 Max. :643702
Looking at our selection of edges, it would appear that degree_in_to (number of retweets a user has received) and page_rank_to would suit a first visualisation; these two are a good guide for the type of directed networks we encounter.
we will make a static, directed graph, where we label any node which is in the top 10% of our size or colour variables:
edges %>%
viz_centrality_network(colour = page_rank_to, size = degree_in_to, directed = TRUE, type = "static", label_prop = 0.1)
Now we will create an interactive network, we will use the same size/colour variables as we did for the static network:
edges %>%
viz_centrality_network(colour = page_rank_to, size = degree_in_to, directed = TRUE, type = "interactive", physics = TRUE, width = "700px", height = "600px")
Notice that the nodes are flying around, as if in a mini-universe of interacting objects - depending on how long it took you to get to this stage as the movement will cease eventually - this is because we have set physics to TRUE. Sometimes without physics our interactive graphs will not settle on an appropriate layout. If you turn it off, you may sometimes find nodes clustered tightly and not ‘spreading out’.
we will also add a title and some explanatory text to the subtitle.
See ?visNetwork::visNetwork()
for additional arguments.
edges %>%
viz_centrality_network(colour = page_rank_to, size = degree_in_to, directed = TRUE, type = "interactive", physics = FALSE, main = "ConnectR Centrality Release Interactive Plot",
submain = "We can see that Ronald_vanLoon is an influential node in the network, as well as Omkar_Raii & stpiindia", width = "700px", height = "600px")
Where possible, it is generally not advised to convert a directed network to an undirected network. However, for practical reasons modelling a directed network as an undirected network can be useful - as long as the user understands what information was lost in the process.
If converting a retweet network to an undirected network, we lose
information on who interacted with whom; so all we know is that there
was an interaction between two users. However, we gain access to
Eigenvector Centrality
as a measure of centrality, which
will take into account not just a node’s number of connections, but how
influential those connections are themselves. We can also approximate
betweenness for users who have not retweeted other users.
(directed <- rts %>% calculate_centrality(from = sender, to = original_author, directed = FALSE))
#> Warning in betweenness(graph = graph, v = V(graph), directed = directed, :
#> 'nobigint' is deprecated since igraph 1.3 and will be removed in igraph 1.4
#> $edges
#> # A tibble: 921 × 11
#> from to n_retweets degree_to betweenness_to eigen_to page_rank_to
#> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 "–ë–æ–≥–¥–∞–… Rona… 1 245 139446. 1 0.127
#> 2 "–í–ª–∞–¥–∏–… Rona… 1 245 139446. 1 0.127
#> 3 "–ü–∞–≤–µ–ª … Rona… 1 245 139446. 1 0.127
#> 4 "–û–ª–µ–≥ –ò… Rona… 1 245 139446. 1 0.127
#> 5 "‚òÅÔ∏èSteve… mark… 1 60 3232. 0 0.0313
#> 6 "(A.R.A.)¬Æ" Rona… 1 245 139446. 1 0.127
#> 7 "‡§Ö‡§Æ‡§® ‡… stpi… 1 130 74475. 0.0198 0.0544
#> 8 "‡§Ö‡§Æ‡•ã‡§… Omka… 1 123 57869. 0.0170 0.0509
#> 9 "‡§Ö‡§Æ‡•ã‡§… stpi… 1 130 74475. 0.0198 0.0544
#> 10 "‡§µ‡§ø‡§ï‡•… mark… 1 60 3232. 0 0.0313
#> # … with 911 more rows, and 4 more variables: degree_from <dbl>,
#> # betweenness_from <dbl>, eigen_from <dbl>, page_rank_from <dbl>
#>
#> $user_stats
#> # A tibble: 879 × 7
#> name degree betweenness eigen page_rank user_retweets_o… user_gets_retwe…
#> <chr> <dbl> <dbl> <dbl> <dbl> <int> <int>
#> 1 –ë–æ–… 1 0 0.0638 0.000610 1 NA
#> 2 Ronal… 245 139446. 1 0.127 NA 254
#> 3 –í–ª–… 1 0 0.0638 0.000610 1 NA
#> 4 –ü–∞–… 1 0 0.0638 0.000610 1 NA
#> 5 –û–ª–… 1 0 0.0638 0.000610 1 NA
#> 6 ‚òÅÔ∏… 1 0 0 0.000614 1 NA
#> 7 markr… 60 3232. 0 0.0313 NA 60
#> 8 (A.R.… 1 0 0.0638 0.000610 1 NA
#> 9 ‡§Ö‡§… 1 0 0.00126 0.000526 1 NA
#> 10 stpii… 130 74475. 0.0198 0.0544 NA 133
#> # … with 869 more rows
Notice that we picked out Ronald_vanLoon as an influential user in our directed network - his betweenness has gone from 0 in the directed network to 139,446 in the undirected. Why?
In our directed network there was no path out from his node because he had not retweeted anyone, meaning he did not figure in any shortest paths - anyone attempting to travel through the network would get stuck.
We also have an Eigenvector Centrality
score.
Also notice that we do not have degree_in and degree_out columns - this is because in undirected networks degree is the total number of edges that attach to each node.
Social media networks are often large and disconnected - disconnected meaning there are many unconnected subcommunities, or subclusters. Sometimes we want to focus specifically on the largest connected cluster of nodes. We can use the create_supercluster function to do so.
First we calculate_centrality()
, then we pluck the edges
and feed them into the create_supercluster()
function.
rt_supercluster <- rts %>% calculate_centrality(from = sender, to = original_author, directed = TRUE)
#> Warning in betweenness(graph = graph, v = V(graph), directed = directed, :
#> 'nobigint' is deprecated since igraph 1.3 and will be removed in igraph 1.4
#> Warning in eigen_centrality(graph = .G(), directed = directed, scale = scale, :
#> At core/centrality/centrality_other.c:329 : graph is directed and acyclic;
#> eigenvector centralities will be zeros.
rt_supercluster <- rt_supercluster %>%
pluck('edges') %>%
create_supercluster(directed = TRUE)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#> Warning in betweenness(graph = graph, v = V(graph), directed = directed, :
#> 'nobigint' is deprecated since igraph 1.3 and will be removed in igraph 1.4
#> Warning in eigen_centrality(graph = .G(), directed = directed, scale = scale, :
#> At core/centrality/centrality_other.c:329 : graph is directed and acyclic;
#> eigenvector centralities will be zeros.
now let’s visualise:
rt_supercluster %>%
viz_centrality_network(colour = page_rank_to, size = degree_in_to)
To make a mention network with centrality measures, we will first extract mentions from a text variable and then we will calculate centrality and visualise.
One thing to note is that there will tend to be many more mentions per post than retweets as some posts mention tens of accounts. When added to the fact that a user can be mentioned without tweeting beforehand - unlike in the retweet network, where a tweet is a necessary requirement to being retweeted - this can make it difficult for graph-solving algorithms to correctly place nodes. Another way of thinking about this is that the universe of possible mentions is much larger than the universe of possible retweets.
mentions <- mention_example %>%
janitor::clean_names()
mentions <- extract_mentions(mentions, author_variable = screen_name, text_variable = text)
mentions_centrality <- mentions %>%
calculate_centrality(from = screen_name, to = mentions, directed = TRUE)
#> Warning in betweenness(graph = graph, v = V(graph), directed = directed, :
#> 'nobigint' is deprecated since igraph 1.3 and will be removed in igraph 1.4
#> Warning in eigen_centrality(graph = .G(), directed = directed, scale = scale, :
#> At core/centrality/centrality_other.c:329 : graph is directed and acyclic;
#> eigenvector centralities will be zeros.
Now we should inspect the network to determine what our size and colour variables should be. we will pluck the user stats data frame first, and arrange it by degree_in - or the number of times a user was mentioned.
mentions_centrality %>% purrr::pluck('user_stats') %>%
arrange(desc(degree_in))
#> # A tibble: 1,301 × 8
#> name degree_in degree_out betweenness page_rank eigen user_retweets_o…
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 nytimes 7 0 0 0.00113 0 NA
#> 2 WSJ 5 0 0 0.000899 0 NA
#> 3 amazon 4 0 0 0.000812 0 NA
#> 4 realDonald… 4 0 0 0.000896 0 NA
#> 5 YouTube 4 0 0 0.000938 0 NA
#> 6 elonmusk 4 0 0 0.000802 0 NA
#> 7 Google 4 0 0 0.000860 0 NA
#> 8 Facebook 3 0 0 0.000831 0 NA
#> 9 washington… 3 0 0 0.000828 0 NA
#> 10 Forbes 3 0 0 0.000795 0 NA
#> # … with 1,291 more rows, and 1 more variable: user_gets_retweeted <int>
We can see that all of our top 10 most-mentioned accounts have 0 degree_out - which means they have not mentioned anyone else, which given that our network is directed, means all of these accounts are also 0 for betweenness. We hinted at some of the differences between a mention network and a retweet network, here we see another corollary of the universe of mentions being much larger than the universe of retweets - our mention network is sparser.
It looks like degree_in and page_rank will be the optimal colour and size variables here:
mentions_centrality %>%
pluck('edges') %>%
viz_mention_centrality_network(size = degree_in_to, colour = page_rank_to, type = "static", label_prop = 0.05, physics = TRUE)
#> Warning: Removed 1176 rows containing missing values (geom_point).
#> Warning: Removed 1256 rows containing missing values (geom_shadow_text).
We can create an interactive visualisation similarly to how we did for our retweet networks. It’s recommended to allow physics when creating an intereactive mention network, as without them it is often difficult for our graph-solving algorithm to arrange our network in a pleasing manner.
mentions_centrality %>%
pluck('edges') %>%
viz_mention_centrality_network(size = degree_in_to, colour = page_rank_to, type = "interactive", label_prop = 0.01, physics = TRUE)
“Error in rlang::ensym()
: ! Can’t convert to a
symbol.”
This probably means that you have not specified a size or colour variable, and you need to!