Closeness centrality can tell us how to find important nodes in a network. The important nodes could disseminate information to many nodes or prevent epidemics, or hubs in a transportaion network, etc.
There are different measures for node importance. Imagine we have a friendship network. The measures could be:
Formally, to measure the node importance, there are degree centrality, closeness centrality, betweenness centrality, page rank, authorities & hubs etc. Different measures have different assumptions. Here in this post we talk about the aforementioned five measures.
The assumption of degree centrality is that important nodes have many connections. The most basic measure is the number of neighbors.
More specifically, for undirected networks, we define it as follows.
Where N is the set of nodes in the network and is the degree of node .
# let us examine with data
import networkx as nx
G = nx.karate_club_graph()
G = nx.convert_node_labels_to_integers(G,first_label = 1)
We can first visualize the graph.
# check the available layout
[x for x in nx.__dir__() if x.endswith('_layout')]
['bipartite_layout',
'circular_layout',
'kamada_kawai_layout',
'random_layout',
'rescale_layout',
'shell_layout',
'spring_layout',
'spectral_layout',
'planar_layout',
'fruchterman_reingold_layout',
'spiral_layout',
'multipartite_layout']
import matplotlib.pyplot as plt
plt.figure(figsize = (15, 15))
pos = nx.kamada_kawai_layout(G)
nx.draw_networkx(G, pos)
# use degree_centrality to measure to degree centrality
degCent = nx.degree_centrality(G)
# this gives us a dictionary of centrality of each node
# show first two nodes
list(degCent.items())[:2]
[(1, 0.48484848484848486), (2, 0.2727272727272727)]
# calculate the first node centrality by definition: dv/ (|N|-1)
(len([n for n in G[1]])) / (G.number_of_nodes() -1)
0.48484848484848486
We can see that the frist nodes given by function is the same as the one calculated by definition.
For directed networks, the centrality is defined as follows.
Where N = set of nodes in network, = the in-degree of node .
# in-degree centrality
indegCent = nx.in_degree_centrality(G)
We can also use out-degree instead of in-degree.
# out-degree centrality
outdegCent = nx.out_degree_centrality(G)
The assumption of closeness centrality is that important nodes are close to other nodes.
Where N = number of nodes in the network. = length of shortest path from to . In the denominator is the sum of all the other nodes' distances between node .
closeCent = nx.closeness_centrality(G)
# check the closeness centrality of node 1
closeCent[1]
0.5689655172413793
# what about if we calculate by definition?
# the denominator
sum(nx.shortest_path_length(G,1).values())
58
# the numerator
len(G.nodes())-1
33
(len(G.nodes())-1) / sum(nx.shortest_path_length(G,1).values())
0.5689655172413793
The results are the same. Great.
Then we face a problem, how to measure the closeness centrality of a node that can not reach all other nodes?
One option is to consider only nodes that node L can reach.
Where is the set of nodes L can reach.
Another option is to consider only nodes that node L can reach and normalize by the fraction of nodes L can reach.
# use closenss_centrality to find centrality for unconnected network
closeCent = nx.closenss_centrality(G, normalized = True)
The assumption for betweenness centrality is that important nodes connect other nodes.
Where = the number of shortest paths between nodes s and t that pass through node v and = the number of shortest paths between nodes and .
This intuition behind the equation is that node has high betweenness centrality if it shows up in many of the shortest paths of nodes and . We can either include or exclude node as node and in the computation.
Betweenness centrality values will be larger in graphs with many nodes. Therefore we need mormalization to control this. We devide centrality values by the number of pairs of nodes in the graph. In undirected graphs we normaliza it by and in directed graphs it is .
# networkx code
btwnCent = nx.betweenness_centrality(G, normalized= True, endpoints= False)
# take a look at the first 5 nodes
list(btwnCent.items())[:5]
[(1, 0.43763528138528146),
(2, 0.053936688311688304),
(3, 0.14365680615680618),
(4, 0.011909271284271283),
(5, 0.0006313131313131313)]
# the top 5 nodes with largest betweenness centrality
import operator
sorted(btwnCent.items(), key = operator.itemgetter(1), reverse=True)[0:5]
[(1, 0.43763528138528146),
(34, 0.30407497594997596),
(33, 0.145247113997114),
(3, 0.14365680615680618),
(32, 0.13827561327561325)]
Since computing the betweenness centrality can be computationally expensive, we can approximate cimputation by taking a subset of nodes.
# Only consider 10 nodes instead of all 34
btwnCent_apx = nx.betweenness_centrality(G, normalized= True, endpoints= False, k = 10)
sorted(btwnCent_apx.items(), key = operator.itemgetter(1), reverse=True)[0:5]
[(34, 0.3678018278018279),
(1, 0.3187295574795575),
(33, 0.2452744708994709),
(32, 0.1731251503126503),
(3, 0.10262145262145261)]
Sometimes we might be only interested in the betweennes centrality between a subset of nodes.
# select a bounch of source nodes and target nodes
betwnCent_subset = nx.betweenness_centrality_subset(G, [1,2,3,4,8,10,11,12], [5,6,7,13,14], normalized= True)
list(betwnCent_subset.items())[:5]
[(1, 0.023279671717171716),
(2, 0.00023674242424242425),
(3, 0.004498106060606061),
(4, 0.002130681818181818),
(5, 0.00031565656565656563)]
In addition, we can find the betweenness centrality of edge instead of node. The definition is similar to node's. In networkx it can be found by nx.edge_betweenness_centrality()
.
Page rank is developed by Google founders to measure the importance of webpages from the hyperlink network structure. It assigns a score of importence to each node. It is mainly useful for directed networks because important nodes are those with many in-link=s from important pages. The algorithm is as follows.
Assume n = number of nodes in the network, k = number of steps:
# compute PageRank
PR_graph = nx.pagerank(G, alpha=0.8)
Please note that alpha
in the equation is a dampling parameter. With probability alpha, the random walk of k steps will choose an outgoing edge at random and follow it to the next node. With probability 1-alpha, it will choose a node at random and go to it. This dampling parameter prevents the random walk from getting 'stuck' on specific nodes. Typically, we choose alpha between 0.8 and 0.9.
Another way to find the central nodes in a network. This method also developed in the context of how a search engine might go about finding important web pages given a query using the hyperlink structure of the web.
The intuition is that instead of considering the whole network, we start with a subset of network.
Then we run the HITS algorithm. It starts by contructing a root set of relevant web pages and expanding it to a base set. It computes k iterations of the HITS algorithm to assign an authority score and hub score to each node.
Similar to PageRank algorithm, after k interations, the authority score and hub score of nodes tend to converge. Nodes that have incoming edges from good hubs are good authorities, and nodes that have outgoing edges to good authorities are good hubs.ds
# compute hub and authority scores of G
HA_graph = nx.hits(G)
# hub score
list(HA_graph[0].items())[:5]
[(1, 0.07141272880825199),
(2, 0.05342723123552999),
(3, 0.0637190645563748),
(4, 0.042422737124708995),
(5, 0.015260959706207484)]
# authority score
list(HA_graph[1].items())[:5]
[(1, 0.07141272880825196),
(2, 0.05342723123553),
(3, 0.06371906455637479),
(4, 0.042422737124708995),
(5, 0.015260959706207484)]
hits()
function gives two dictionaries, keyed by node, with the hub and authority scores of the nodes.
After understanding different centrality measures and the corresponding functions, we an use them to solve many real-world problems. For example, if we are employed by an online shopping website and asked to select users in the network to send voucher. And we expect the receivers will send voucher to their friends, but the travel distance of the voucher is limited to one step. Which users should we select?
Or if the competitor wants to remove a persom from the network in order to disrupt the distribution of our company's voucher. The comprtitor is targeting people who are often bridges of information flow, then who are the riskiest person?
If we have a large collection of websites, what are important websites among them?
All of these can be easily done by networkx
. However, it is always important to keep in mind the intuition and principles behind each measures.
Thank you for reading. Cheers!