Network Influence Measures

2021-08-18 · 12 min read

Closeness centrality can tell us how to find important nodes in a network. The important nodes could disseminate information to many nodes or prevent epidemics, or hubs in a transportaion network, etc.

There are different measures for node importance. Imagine we have a friendship network. The measures could be:

Degree: number of friends.
Average proximity to other nodes.
Fraction of shortest paths that pass through node.

Formally, to measure the node importance, there are degree centrality, closeness centrality, betweenness centrality, page rank, authorities & hubs etc. Different measures have different assumptions. Here in this post we talk about the aforementioned five measures.

Degree centrality

The assumption of degree centrality is that important nodes have many connections. The most basic measure is the number of neighbors.

Undirected networks

More specifically, for undirected networks, we define it as follows.

C_{deg}(v) = \frac{d_{v}}{|N| -1}

Where N is the set of nodes in the network and $d_{v}$ is the degree of node $v$ .

# let us examine with data
import networkx as nx
G = nx.karate_club_graph()
G = nx.convert_node_labels_to_integers(G,first_label = 1)

We can first visualize the graph.

# check the available layout
[x for x in nx.__dir__() if x.endswith('_layout')]

['bipartite_layout',
 'circular_layout',
 'kamada_kawai_layout',
 'random_layout',
 'rescale_layout',
 'shell_layout',
 'spring_layout',
 'spectral_layout',
 'planar_layout',
 'fruchterman_reingold_layout',
 'spiral_layout',
 'multipartite_layout']

import matplotlib.pyplot as plt
plt.figure(figsize = (15, 15))
pos = nx.kamada_kawai_layout(G)
nx.draw_networkx(G, pos)

# use degree_centrality to measure to degree centrality
degCent = nx.degree_centrality(G)
# this gives us a dictionary of centrality of each node
# show first two nodes
list(degCent.items())[:2]

[(1, 0.48484848484848486), (2, 0.2727272727272727)]

# calculate the first node centrality by definition: dv/ (|N|-1)
(len([n for n in G[1]])) / (G.number_of_nodes() -1)

0.48484848484848486

We can see that the frist nodes given by function is the same as the one calculated by definition.

Directed networks

For directed networks, the centrality is defined as follows.

c_{indeg}(v) = \frac{d_{v}^{in}}{|N|-1}

Where N = set of nodes in network, $d_{v}^{in}$ = the in-degree of node $v$ .

# in-degree centrality
indegCent = nx.in_degree_centrality(G)

We can also use out-degree instead of in-degree.

c_{outdeg}(v) = \frac{d_{v}^{out}}{|N|-1}

# out-degree centrality
outdegCent = nx.out_degree_centrality(G)

Closeness centrality

The assumption of closeness centrality is that important nodes are close to other nodes.

C_{close}(v) = \frac{|N|-1}{\sum_{u \in N/{v}}d(v,u)}

Where N = number of nodes in the network. $d(v,u)$ = length of shortest path from $v$ to $u$ . In the denominator is the sum of all the other nodes' distances between node $v$ .

closeCent = nx.closeness_centrality(G)
# check the closeness centrality of node 1
closeCent[1]

0.5689655172413793

# what about if we calculate by definition?
# the denominator
sum(nx.shortest_path_length(G,1).values())

# the numerator
len(G.nodes())-1

(len(G.nodes())-1) / sum(nx.shortest_path_length(G,1).values())

0.5689655172413793

The results are the same. Great.

Then we face a problem, how to measure the closeness centrality of a node that can not reach all other nodes?

One option is to consider only nodes that node L can reach.

C_{close}(L) = \frac{|R(L)|}{\sum_{u \in R(L)}d(L,u)}

Where $R(L)$ is the set of nodes L can reach.

Another option is to consider only nodes that node L can reach and normalize by the fraction of nodes L can reach.

C_{close}(L) = \frac{|R_{L}|}{|N-1|}\frac{|R(L)|}{\sum_{u \in R(L)}d(L,u)}

# use closenss_centrality to find centrality for unconnected network
closeCent = nx.closenss_centrality(G, normalized = True)

Betweenness centrality

The assumption for betweenness centrality is that important nodes connect other nodes.

C_{btw}(v) = \sum_{s,t \in N} \frac{\sigma_{s,t}(v)}{\sigma_{s,t}}

Where $\sigma_{s,t}(v)$ = the number of shortest paths between nodes s and t that pass through node v and $\sigma_{s,t}$ = the number of shortest paths between nodes $s$ and $t$ .
This intuition behind the equation is that node $v$ has high betweenness centrality if it shows up in many of the shortest paths of nodes $s$ and $t$ . We can either include or exclude node $v$ as node $s$ and $t$ in the computation.

Betweenness centrality values will be larger in graphs with many nodes. Therefore we need mormalization to control this. We devide centrality values by the number of pairs of nodes in the graph. In undirected graphs we normaliza it by $\frac{1}{2} (|N|-1)(|N|-2)$ and in directed graphs it is $(|N|-1)(|N|-2)$ .

# networkx code
btwnCent = nx.betweenness_centrality(G, normalized= True, endpoints= False)
# take a look at the first 5 nodes
list(btwnCent.items())[:5]

[(1, 0.43763528138528146),
 (2, 0.053936688311688304),
 (3, 0.14365680615680618),
 (4, 0.011909271284271283),
 (5, 0.0006313131313131313)]

# the top 5 nodes with largest betweenness centrality
import operator
sorted(btwnCent.items(), key = operator.itemgetter(1), reverse=True)[0:5]

[(1, 0.43763528138528146),
 (34, 0.30407497594997596),
 (33, 0.145247113997114),
 (3, 0.14365680615680618),
 (32, 0.13827561327561325)]

Since computing the betweenness centrality can be computationally expensive, we can approximate cimputation by taking a subset of nodes.

# Only consider 10 nodes instead of all 34
btwnCent_apx = nx.betweenness_centrality(G, normalized= True, endpoints= False, k = 10)
sorted(btwnCent_apx.items(), key = operator.itemgetter(1), reverse=True)[0:5]

[(34, 0.3678018278018279),
 (1, 0.3187295574795575),
 (33, 0.2452744708994709),
 (32, 0.1731251503126503),
 (3, 0.10262145262145261)]

Sometimes we might be only interested in the betweennes centrality between a subset of nodes.

# select a bounch of source nodes and target nodes
betwnCent_subset = nx.betweenness_centrality_subset(G, [1,2,3,4,8,10,11,12], [5,6,7,13,14], normalized= True)
list(betwnCent_subset.items())[:5]

[(1, 0.023279671717171716),
 (2, 0.00023674242424242425),
 (3, 0.004498106060606061),
 (4, 0.002130681818181818),
 (5, 0.00031565656565656563)]

In addition, we can find the betweenness centrality of edge instead of node. The definition is similar to node's. In networkx it can be found by nx.edge_betweenness_centrality().

Page rank

Page rank is developed by Google founders to measure the importance of webpages from the hyperlink network structure. It assigns a score of importence to each node. It is mainly useful for directed networks because important nodes are those with many in-link=s from important pages. The algorithm is as follows.

Assume n = number of nodes in the network, k = number of steps:

Assign all nodes a PageRank of 1/n.
Perform the Basic PageRank Update Rule k times. I.e. Each node gives an eqaual share of its current PageRank to al the nodes it links to.
The new PageRank of each node is the sum of all the PageRank it received from other nodes.
Then we iterate the process above for k times. Usually the PageRank will converge as k grows larger.

# compute PageRank
PR_graph = nx.pagerank(G, alpha=0.8)

Please note that alpha in the equation is a dampling parameter. With probability alpha, the random walk of k steps will choose an outgoing edge at random and follow it to the next node. With probability 1-alpha, it will choose a node at random and go to it. This dampling parameter prevents the random walk from getting 'stuck' on specific nodes. Typically, we choose alpha between 0.8 and 0.9.

Hubs and authorities

Another way to find the central nodes in a network. This method also developed in the context of how a search engine might go about finding important web pages given a query using the hyperlink structure of the web.

Find root: a set of hightly relevant webpages (e.g. pages that contain the query string) --- potential authorities.
Find all pages that link a page in root. --- potential hubs.
Find base: root nodes and any node that linkes to a node in root.
Consider all edges connecting nodes in base set.

The intuition is that instead of considering the whole network, we start with a subset of network.

Then we run the HITS algorithm. It starts by contructing a root set of relevant web pages and expanding it to a base set. It computes k iterations of the HITS algorithm to assign an authority score and hub score to each node.

Assign each node an authority and hub score of 1.
Apply the Authority Update Rule: each node's authority score is the sum of hub scores of each node that points to it.
Apple the Hub Update Rule: each node's hub score is the sum of authority scores of each node that it points to.

Similar to PageRank algorithm, after k interations, the authority score and hub score of nodes tend to converge. Nodes that have incoming edges from good hubs are good authorities, and nodes that have outgoing edges to good authorities are good hubs.ds

# compute hub and authority scores of G
HA_graph = nx.hits(G)
# hub score
list(HA_graph[0].items())[:5]

[(1, 0.07141272880825199),
 (2, 0.05342723123552999),
 (3, 0.0637190645563748),
 (4, 0.042422737124708995),
 (5, 0.015260959706207484)]

# authority score
list(HA_graph[1].items())[:5]

[(1, 0.07141272880825196),
 (2, 0.05342723123553),
 (3, 0.06371906455637479),
 (4, 0.042422737124708995),
 (5, 0.015260959706207484)]

hits() function gives two dictionaries, keyed by node, with the hub and authority scores of the nodes.

Ending

After understanding different centrality measures and the corresponding functions, we an use them to solve many real-world problems. For example, if we are employed by an online shopping website and asked to select users in the network to send voucher. And we expect the receivers will send voucher to their friends, but the travel distance of the voucher is limited to one step. Which users should we select?

Or if the competitor wants to remove a persom from the network in order to disrupt the distribution of our company's voucher. The comprtitor is targeting people who are often bridges of information flow, then who are the riskiest person?

If we have a large collection of websites, what are important websites among them?

All of these can be easily done by networkx. However, it is always important to keep in mind the intuition and principles behind each measures.

Thank you for reading. Cheers!

Degree centrality
- Undirected networks
- Directed networks
Closeness centrality
Betweenness centrality
Page rank
Hubs and authorities
Ending