Network Analysis Basics

2021-08-04 · 18 min read

Networks is a set of objects (nodes) with interconnections (edges). Many complex structures can be represented by networks. It is everywhere in different forms. For example, family network, Facebook communication network, subway network, food web, etc.

There are plenty of things we can do with networks. For example, from an e-mail communication network we can detect if a rumor is likely to spread in the network. The people who is the most influential in the organization. From a friendship network, we can check if it is likely to split a club into two groups. From a network of flights around the world, we can examine which airports are at highest risk for virus spreading or parts of the world that are difficult to reach.

Networkx vocabulary

Network (or Graph) is a representation of connections among a set of items. Items are called nodes. Connections are called edges.

In python we can build network like this:

import networkx as nx
G = nx.Graph()
G.add_edge('A', 'B')
G.add_edge('B', 'C')

In network, there are symmetric and asymmetric relationships. They can be represented by undirected network and directed network respectively.

# undirected network
G = nx.Graph()
G.add_edge('A', 'B')
G.add_edge('B', 'C')

# directed network
G = nx.DiGraph()
G.add_edge('A', 'B')
G.add_edge('A', 'C')

Weighted network

As we know, not all relationships are equal which means some edges can be weighted higher than others. That brings us to weighted network. Weighted network is a network where edges are assigned a weigtht. In Python this can be done by adding attribute 'weight'.

G = nx.Graph()
G.add_edge('A', 'B', weight = 6)
G.add_edge('B', 'C', weight = 12)

Signed networks

Some networks can carry information about friendship and antagonism based on conflict or disagreement. Signed network is a network where edges are assigned positive or negative sign. This can be done in Python by adding attribute 'sign'.

G = nx.Graph()
G.add_edge('A', 'B', sign = '+')

Other edge attributes

Edges can carry many other labels or attributes

G = nx.Graph()
G.add_edge('A', 'B', relation = 'friend')
G.add_edge('B', 'C', relation = 'coworker')

Multigraphs

A pair of nodes can have more than one type of relationships simultaneously.

G = nx.MultiGraph()
G.add_edge('A', 'B', relation = 'friend')
G.add_edge('A', 'B', relation = 'neighbour')

Edge attributes

Here let us continue on assessing the loaded the Edge attributes.

G = nx.Graph()
G.add_edge('A', 'B', relation = 'friend', weight = 6)
G.add_edge('B', 'C', relation = 'coworker')

# find the list of edges 
G.edges()

# list of all edges with attributes
G.edges(data = True)

# for particular attribute
G.edges(data = 'relation')

# attribute of specific edge
G.edge['A']['B']

# specific attribute of specific edge
G.edge['A']['B']['weight']

For undirected graph, the order of A and B does not matter. However, for directed graph the order does matter.

In MultiGraph:

G = nx.MultiGraph()
G.add_edge('A', 'B', relation = 'friend', weight = 6)
G.add_edge('A', 'B', relation = 'neighbour', weight = 10)

# accessing edge attributes
G.edge['A']['B'] # this gives a dictionary of attrbute per edge

G.edge['A']['B'][0]['weight']

Node attributes

To add node attributes, we can do as follows

G = nx.Graph()
G.add_edge('A', 'B', weight = 6, relation = 'family')
G.add_edge('B', 'C', weight = 13, relation = 'friend')

# adding node attributes
G.add_node('A', role = 'trader')
G.add_node('B', role = 'trader')
G.add_node('C', role = 'manager')

To assess node attributes:

# list of all nodes
G.nodes()

# list of all nodes with attributes
G.nodes(data = True)

# role of node A
G.node['A']['role']

Bipartite graphs

Bipartite graphs are whose nodes can be split into two sets L and R. Every edge connects an node in L with a node in R. For example, if we have fans A, B, C and three basketball teams 1, 2, 3. If we ask each of the three fans to choose the teams they like. Then we can make a bipartite graphs out of their preference.

from networkx.algorithms import bipartite

B = nx.Graph() # no separate class for bipartite graphs

# label one set of nodes 0
B.add_nodes_from(['A', 'B', 'C', 'D', 'E'], bipartite = 0)
# label other set of nodes 1
B.add_edges_from(['A',1],['B',1], ['C',1], ['C',3], ['D',2], ['E',3], ['E',4])

Check if a graph is bipartite:

bipartite.is_bipartite(B)

Check if a set of nodes is a bipartition of a graph:

X = set([1,2,3,4])
bipartite.is_bipartite_node_set(B,X)

Get each set of nodes of a bipartite graph:

bipartite.sets(B)

If we ask for a graph that is not bipartite, then the code above will give error message.

Projected graphs

L-Bipartite graph projection is a network of nodes in group L, where a pair of nodes is connected if they have a common neighbor in R in the bipartite graph.

Let us assume than A, B, C, D are a group of friends. 1, 2, 3 are a group of teams.

B = nx.Graph()
B.add_edges_from([('A', 1), ('B', 1), ('C', 1), ('D', 1), ('B', 2), ('C', 2)])

Get a network of fans who have a team in common:

X = set(['A', 'B', 'C', 'D'])
P = bipartite.projected_graph(B,X)

Get a network of teams who have a fan common:

X = set([1,2,3,4])
P = bipartite.projected_graph(B,X)

L-Bipartite weighted graph projection: An L-Bipartite graph projection with weights on the edges that are proportional to the number of common neighbors between the nodes. In Python we can get it as follows:

X = set([1,2,3,4])
P = bipartite.weighted_projected_graph(B,X)

Graphs manipulation exercise

Eight employees at a small company were asked to choose 3 movies that they would most enjoy watching for the upcoming company movie night. These choices are stored in the file Employee_Movie_Choices.txt.

A second file, Employee_Relationships.txt, has data on the relationships between different coworkers.

The relationship score has value of -100 (Enemies) to +100 (Best Friends). A value of zero means the two employees haven't interacted or are indifferent.

Load `Employee_Movie_Choices.txt` in bipartite graph

# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive

import networkx as nx
import pandas as pd
import numpy as np
from networkx.algorithms import bipartite


# This is the set of employees
employees = set(['Pablo',
                 'Lee',
                 'Georgia',
                 'Vincent',
                 'Andy',
                 'Frida',
                 'Joan',
                 'Claude'])

# This is the set of movies
movies = set(['The Shawshank Redemption',
              'Forrest Gump',
              'The Matrix',
              'Anaconda',
              'The Social Network',
              'The Godfather',
              'Monty Python and the Holy Grail',
              'Snakes on a Plane',
              'Kung Fu Panda',
              'The Dark Knight',
              'Mean Girls'])

df_EC = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Social_Network_Analysis_in_Python/resources/Employee_Movie_Choices.txt', 
sep = '\t')
df_EC.head()

	#Employee	Movie
0	Andy	Anaconda
1	Andy	Mean Girls
2	Andy	The Matrix
3	Claude	Anaconda
4	Claude	Monty Python and the Holy Grail

Use from_pandas_edgelist to load dataframe in graph.

G_bi = nx.from_pandas_edgelist(df_EC, '#Employee', 'Movie')
G_bi

<networkx.classes.graph.Graph at 0x7f4d2248fbd0>

# check the number of nodes
G_bi.number_of_nodes()

Let us see how the graph looks like.

nx.draw_networkx(G_bi)

The plot looks a bit messy. We can use matplotlib to fix it, but I would like to skip optimizing visualization here.

Add node attribute

First let us see how the nodes look like.

G_bi.nodes()

NodeView(('Andy', 'Anaconda', 'Mean Girls', 'The Matrix', 'Claude', 'Monty Python and the Holy Grail', 'Snakes on a Plane', 'Frida', 'The Shawshank Redemption', 'The Social Network', 'Georgia', 'Joan', 'Forrest Gump', 'Kung Fu Panda', 'Lee', 'Pablo', 'The Dark Knight', 'Vincent', 'The Godfather'))

It consists of employees' names and the movies name. Let us add an attribute type to the nodes so that it will be better understandable. This can be achieved by using set_node_attributes() function.

# st
Dict = {}
for name in employees:
  Dict[name] = {'type' : 'employee'}
for movie in movies:
  Dict[movie] = {'type' : 'movie'}

Dict

{'Anaconda': {'type': 'movie'},
 'Andy': {'type': 'employee'},
 'Claude': {'type': 'employee'},
 'Forrest Gump': {'type': 'movie'},
 'Frida': {'type': 'employee'},
 'Georgia': {'type': 'employee'},
 'Joan': {'type': 'employee'},
 'Kung Fu Panda': {'type': 'movie'},
 'Lee': {'type': 'employee'},
 'Mean Girls': {'type': 'movie'},
 'Monty Python and the Holy Grail': {'type': 'movie'},
 'Pablo': {'type': 'employee'},
 'Snakes on a Plane': {'type': 'movie'},
 'The Dark Knight': {'type': 'movie'},
 'The Godfather': {'type': 'movie'},
 'The Matrix': {'type': 'movie'},
 'The Shawshank Redemption': {'type': 'movie'},
 'The Social Network': {'type': 'movie'},
 'Vincent': {'type': 'employee'}}

# feed Dict to set_node_attributes function
nx.set_node_attributes(G_bi, Dict)

# let us see if it is successful
G_bi.nodes(data = True)

NodeDataView({'Andy': {'type': 'employee'}, 'Anaconda': {'type': 'movie'}, 'Mean Girls': {'type': 'movie'}, 'The Matrix': {'type': 'movie'}, 'Claude': {'type': 'employee'}, 'Monty Python and the Holy Grail': {'type': 'movie'}, 'Snakes on a Plane': {'type': 'movie'}, 'Frida': {'type': 'employee'}, 'The Shawshank Redemption': {'type': 'movie'}, 'The Social Network': {'type': 'movie'}, 'Georgia': {'type': 'employee'}, 'Joan': {'type': 'employee'}, 'Forrest Gump': {'type': 'movie'}, 'Kung Fu Panda': {'type': 'movie'}, 'Lee': {'type': 'employee'}, 'Pablo': {'type': 'employee'}, 'The Dark Knight': {'type': 'movie'}, 'Vincent': {'type': 'employee'}, 'The Godfather': {'type': 'movie'}})

Great!

Weighted projection graph

Since an employee might choose more than one movies, and we want to know how many movies different pairs of employees may choose in common, we can find it through weighted projection graph.

p = bipartite.weighted_projected_graph(G_bi,employees)

We can use the to_pandas_edgelist functiuon to find the dataframe form of graph.

nx.to_pandas_edgelist (p)

	source	target	weight
0	Andy	Claude	1
1	Andy	Pablo	1
2	Andy	Frida	1
3	Andy	Lee	1
4	Andy	Joan	1
5	Andy	Georgia	1
6	Claude	Georgia	3
7	Pablo	Frida	2
8	Pablo	Vincent	1
9	Frida	Vincent	2
10	Lee	Joan	3

Notive that the original dataframe does not have the weight column. It tells us the number of common interested moveis of two employees. We may also find it by wrangling, however, with Network Analysis we can find it easily with several lines of code. So cool!

Relationship VS Types of movies

Suppose given the two data files, we would like to find out if people that have a high relationship score also like the same types of movies.

# read relationship file
df_R = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Social_Network_Analysis_in_Python/resources/Employee_Relationships.txt', delim_whitespace=True, 
                    names = ['Employee1', 'Employee2', 'Relationship'] )
df_R.head()

	Employee1	Employee2	Relationship
0	Andy	Claude	0
1	Andy	Frida	20
2	Andy	Georgia	-10
3	Andy	Joan	30
4	Andy	Lee	-10

An overall process can be as follows:

Find the graph forms of both dataframes.
Merge the two graphs by nx.compose method.
Use to_pandas_edgelist to transform the merged graph to dataframe.
Clean dataframe.
Find the correlation by pd.corr function

# load in the relationship file 
G_relations =nx.from_pandas_edgelist(df_R,'Employee1','Employee2', edge_attr =  'Relationship' )

# merge the two files
merge_graph = nx.compose(p,G_relations)
df_merge = nx.to_pandas_edgelist (merge_graph )

df_merge.head(15)

	source	target	Relationship	weight
0	Andy	Claude	0	1.0
1	Andy	Pablo	-10	1.0
2	Andy	Frida	20	1.0
3	Andy	Lee	-10	1.0
4	Andy	Joan	30	1.0
5	Andy	Georgia	-10	1.0
6	Andy	Vincent	20	NaN
7	Claude	Georgia	90	3.0
8	Claude	Frida	0	NaN
9	Claude	Joan	0	NaN
10	Claude	Lee	0	NaN
11	Claude	Pablo	10	NaN
12	Claude	Vincent	0	NaN
13	Pablo	Frida	50	2.0
14	Pablo	Vincent	-20	1.0

# replace NaN with 0
df_merge['weight'] = df_merge['weight'].replace(np.nan, 0)

df_merge.corr(method ='pearson')

	weight	Relationship
weight	1.000000	0.906093
Relationship	0.906093	1.000000

We can see that these two have strong correlation.

Another approach

For some old versions of networkx, to_pandas_edgelist or to_pandas_dataframe can not give us the desired form of dataframe. In this case, we can slightly the change the step 3 above as follows:

Find the graph forms of both dataframes.
Merge the two graphs by compose method.
Find the edges dictionary and use loop to get merged dataframe.
Since we make the dataframe from dictionary, there can be nested dictionary in dataframe column. Therefore we need to clean the merged dataframe, in order to get the desired form.
Find the correlation by pd.corr function

# let us see how the edges look like
merge_graph.edges(data = True)

EdgeDataView([('Andy', 'Claude', {'weight': 1, 'Relationship': 0}), ('Andy', 'Pablo', {'weight': 1, 'Relationship': -10}), ('Andy', 'Frida', {'weight': 1, 'Relationship': 20}), ('Andy', 'Lee', {'weight': 1, 'Relationship': -10}), ('Andy', 'Joan', {'weight': 1, 'Relationship': 30}), ('Andy', 'Georgia', {'weight': 1, 'Relationship': -10}), ('Andy', 'Vincent', {'Relationship': 20}), ('Claude', 'Georgia', {'weight': 3, 'Relationship': 90}), ('Claude', 'Frida', {'Relationship': 0}), ('Claude', 'Joan', {'Relationship': 0}), ('Claude', 'Lee', {'Relationship': 0}), ('Claude', 'Pablo', {'Relationship': 10}), ('Claude', 'Vincent', {'Relationship': 0}), ('Pablo', 'Frida', {'weight': 2, 'Relationship': 50}), ('Pablo', 'Vincent', {'weight': 1, 'Relationship': -20}), ('Pablo', 'Georgia', {'Relationship': 0}), ('Pablo', 'Joan', {'Relationship': 0}), ('Pablo', 'Lee', {'Relationship': 0}), ('Frida', 'Vincent', {'weight': 2, 'Relationship': 60}), ('Frida', 'Georgia', {'Relationship': 0}), ('Frida', 'Joan', {'Relationship': 0}), ('Frida', 'Lee', {'Relationship': 0}), ('Vincent', 'Georgia', {'Relationship': 0}), ('Vincent', 'Joan', {'Relationship': 10}), ('Vincent', 'Lee', {'Relationship': 0}), ('Lee', 'Joan', {'weight': 3, 'Relationship': 70}), ('Lee', 'Georgia', {'Relationship': 10}), ('Joan', 'Georgia', {'Relationship': 0})])

# use loop to make dataframe
rows = []
for u,v,r in merge_graph.edges(data=True):
     rows.append([u, v, r])

df_merge = pd.DataFrame(rows, columns=["Employee1", "Employee2", "Common"])
df_merge.head()

	Employee1	Employee2	Common
0	Andy	Claude	{'weight': 1, 'Relationship': 0}
1	Andy	Pablo	{'weight': 1, 'Relationship': -10}
2	Andy	Frida	{'weight': 1, 'Relationship': 20}
3	Andy	Lee	{'weight': 1, 'Relationship': -10}
4	Andy	Joan	{'weight': 1, 'Relationship': 30}

We can see that the Common column is not clean. Here, we can use apply(pd.Series) to split the column to.

df_merge['Common'].apply(pd.Series).head()

	weight	Relationship
0	1.0	0.0
1	1.0	-10.0
2	1.0	20.0
3	1.0	-10.0
4	1.0	30.0

Then we combine the splitted columns to the original dataframe by apply and concat function.

df_merge = pd.concat([df_merge.drop(['Common'], axis=1), df_merge['Common'].apply(pd.Series)], axis=1)

df_merge.head(15)

	Employee1	Employee2	weight	Relationship
0	Andy	Claude	1.0	0.0
1	Andy	Pablo	1.0	-10.0
2	Andy	Frida	1.0	20.0
3	Andy	Lee	1.0	-10.0
4	Andy	Joan	1.0	30.0
5	Andy	Georgia	1.0	-10.0
6	Andy	Vincent	NaN	20.0
7	Claude	Georgia	3.0	90.0
8	Claude	Frida	NaN	0.0
9	Claude	Joan	NaN	0.0
10	Claude	Lee	NaN	0.0
11	Claude	Pablo	NaN	10.0
12	Claude	Vincent	NaN	0.0
13	Pablo	Frida	2.0	50.0
14	Pablo	Vincent	1.0	-20.0

Cool! Afterwards we can clean the weight column and find correlation as mentioned above.

Ending

When I was doing the last task, I met troubles with merging the two tables.

Firstly, I tried to use merge function, however, I got stuck in choosing the join-on columns. Because in both dataframes, there are two emplyee names' columns. We do not care about the order of these two columns but the algorithm can detect the difference. Then, I spent a lot of time using the nested loop, and in each iteration, I tried to make a tuple of the two names and sort them, in hope of unifying the names order. However, it fails to give an elegant result.

In the end, I found that composing two graphs is a great solution. If in both graphs have same nodes with same edges, they will be merged into one, while keeping the edge attributes. Therefore my biggest takeaway is that Network Analysis is a great data analysis method, as well as a nice wrangling tool.

Hopefully this post can be helpful to you. Thank you for reading!

Cheers!

Networkx vocabulary
Edge attributes
Node attributes
- Bipartite graphs
  - Projected graphs
Graphs manipulation exercise
Ending