Bolin Wu
#
Network Analysis Basics

# Networkx vocabulary

## Weighted network

## Signed networks

## Other edge attributes

## Multigraphs

# Edge attributes

# Node attributes

## Bipartite graphs

### Projected graphs

# Graphs manipulation exercise

## Load

## Add node attribute

## Weighted projection graph

## Relationship VS Types of movies

### Another approach

# Ending

Networks is a set of objects (nodes) with interconnections (edges). Many complex structures can be represented by networks. It is everywhere in different forms. For example, family network, Facebook communication network, subway network, food web, etc.

There are plenty of things we can do with networks. For example, from an e-mail communication network we can detect if a rumor is likely to spread in the network. The people who is the most influential in the organization. From a friendship network, we can check if it is likely to split a club into two groups. From a network of flights around the world, we can examine which airports are at highest risk for virus spreading or parts of the world that are difficult to reach.

Network (or Graph) is a representation of connections among a set of items. Items are called nodes. Connections are called edges.

In python we can build network like this:

```
import networkx as nx
G = nx.Graph()
G.add_edge('A', 'B')
G.add_edge('B', 'C')
```

In network, there are symmetric and asymmetric relationships. They can be represented by **undirected** network and **directed** network respectively.

```
# undirected network
G = nx.Graph()
G.add_edge('A', 'B')
G.add_edge('B', 'C')
# directed network
G = nx.DiGraph()
G.add_edge('A', 'B')
G.add_edge('A', 'C')
```

As we know, not all relationships are equal which means some edges can be weighted higher than others. That brings us to weighted network. Weighted network is a network where edges are assigned a weigtht. In Python this can be done by adding attribute 'weight'.

```
G = nx.Graph()
G.add_edge('A', 'B', weight = 6)
G.add_edge('B', 'C', weight = 12)
```

Some networks can carry information about friendship and antagonism based on conflict or disagreement. **Signed network** is a network where edges are assigned positive or negative sign. This can be done in Python by adding attribute 'sign'.

```
G = nx.Graph()
G.add_edge('A', 'B', sign = '+')
```

Edges can carry many other labels or attributes

```
G = nx.Graph()
G.add_edge('A', 'B', relation = 'friend')
G.add_edge('B', 'C', relation = 'coworker')
```

A pair of nodes can have more than one type of relationships simultaneously.

```
G = nx.MultiGraph()
G.add_edge('A', 'B', relation = 'friend')
G.add_edge('A', 'B', relation = 'neighbour')
```

Here let us continue on assessing the loaded the Edge attributes.

```
G = nx.Graph()
G.add_edge('A', 'B', relation = 'friend', weight = 6)
G.add_edge('B', 'C', relation = 'coworker')
# find the list of edges
G.edges()
# list of all edges with attributes
G.edges(data = True)
# for particular attribute
G.edges(data = 'relation')
# attribute of specific edge
G.edge['A']['B']
# specific attribute of specific edge
G.edge['A']['B']['weight']
```

For undirected graph, the order of A and B does not matter. However, for directed graph the order does matter.

In MultiGraph:

```
G = nx.MultiGraph()
G.add_edge('A', 'B', relation = 'friend', weight = 6)
G.add_edge('A', 'B', relation = 'neighbour', weight = 10)
# accessing edge attributes
G.edge['A']['B'] # this gives a dictionary of attrbute per edge
G.edge['A']['B'][0]['weight']
```

To add node attributes, we can do as follows

```
G = nx.Graph()
G.add_edge('A', 'B', weight = 6, relation = 'family')
G.add_edge('B', 'C', weight = 13, relation = 'friend')
# adding node attributes
G.add_node('A', role = 'trader')
G.add_node('B', role = 'trader')
G.add_node('C', role = 'manager')
```

To assess node attributes:

```
# list of all nodes
G.nodes()
# list of all nodes with attributes
G.nodes(data = True)
# role of node A
G.node['A']['role']
```

Bipartite graphs are whose nodes can be split into two sets L and R. Every edge connects an node in L with a node in R. For example, if we have fans A, B, C and three basketball teams 1, 2, 3. If we ask each of the three fans to choose the teams they like. Then we can make a bipartite graphs out of their preference.

```
from networkx.algorithms import bipartite
B = nx.Graph() # no separate class for bipartite graphs
# label one set of nodes 0
B.add_nodes_from(['A', 'B', 'C', 'D', 'E'], bipartite = 0)
# label other set of nodes 1
B.add_edges_from(['A',1],['B',1], ['C',1], ['C',3], ['D',2], ['E',3], ['E',4])
```

Check if a graph is bipartite:

```
bipartite.is_bipartite(B)
```

Check if a set of nodes is a bipartition of a graph:

```
X = set([1,2,3,4])
bipartite.is_bipartite_node_set(B,X)
```

Get each set of nodes of a bipartite graph:

```
bipartite.sets(B)
```

If we ask for a graph that is not bipartite, then the code above will give error message.

**L-Bipartite graph projection** is a network of nodes in group L, where a pair of nodes is connected if they have a common neighbor in R in the bipartite graph.

Let us assume than A, B, C, D are a group of friends. 1, 2, 3 are a group of teams.

```
B = nx.Graph()
B.add_edges_from([('A', 1), ('B', 1), ('C', 1), ('D', 1), ('B', 2), ('C', 2)])
```

Get a network of fans who have a team in common:

```
X = set(['A', 'B', 'C', 'D'])
P = bipartite.projected_graph(B,X)
```

Get a network of teams who have a fan common:

```
X = set([1,2,3,4])
P = bipartite.projected_graph(B,X)
```

**L-Bipartite weighted graph projection:** An L-Bipartite graph projection with weights on the edges that are proportional to the number of common neighbors between the nodes. In Python we can get it as follows:

```
X = set([1,2,3,4])
P = bipartite.weighted_projected_graph(B,X)
```

Eight employees at a small company were asked to choose 3 movies that they would most enjoy watching for the upcoming company movie night. These choices are stored in the file Employee_Movie_Choices.txt.

A second file, Employee_Relationships.txt, has data on the relationships between different coworkers.

The relationship score has value of `-100`

(Enemies) to `+100`

(Best Friends). A value of zero means the two employees haven't interacted or are indifferent.

`Employee_Movie_Choices.txt`

in bipartite graph```
# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive
drive.mount('/content/drive')
```

```
Mounted at /content/drive
```

```
import networkx as nx
import pandas as pd
import numpy as np
from networkx.algorithms import bipartite
# This is the set of employees
employees = set(['Pablo',
'Lee',
'Georgia',
'Vincent',
'Andy',
'Frida',
'Joan',
'Claude'])
# This is the set of movies
movies = set(['The Shawshank Redemption',
'Forrest Gump',
'The Matrix',
'Anaconda',
'The Social Network',
'The Godfather',
'Monty Python and the Holy Grail',
'Snakes on a Plane',
'Kung Fu Panda',
'The Dark Knight',
'Mean Girls'])
df_EC = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Social_Network_Analysis_in_Python/resources/Employee_Movie_Choices.txt',
sep = '\t')
df_EC.head()
```

#Employee | Movie | |
---|---|---|

0 | Andy | Anaconda |

1 | Andy | Mean Girls |

2 | Andy | The Matrix |

3 | Claude | Anaconda |

4 | Claude | Monty Python and the Holy Grail |

Use `from_pandas_edgelist`

to load dataframe in graph.

```
G_bi = nx.from_pandas_edgelist(df_EC, '#Employee', 'Movie')
G_bi
```

```
<networkx.classes.graph.Graph at 0x7f4d2248fbd0>
```

```
# check the number of nodes
G_bi.number_of_nodes()
```

```
19
```

Let us see how the graph looks like.

```
nx.draw_networkx(G_bi)
```

The plot looks a bit messy. We can use matplotlib to fix it, but I would like to skip optimizing visualization here.

First let us see how the nodes look like.

```
G_bi.nodes()
```

```
NodeView(('Andy', 'Anaconda', 'Mean Girls', 'The Matrix', 'Claude', 'Monty Python and the Holy Grail', 'Snakes on a Plane', 'Frida', 'The Shawshank Redemption', 'The Social Network', 'Georgia', 'Joan', 'Forrest Gump', 'Kung Fu Panda', 'Lee', 'Pablo', 'The Dark Knight', 'Vincent', 'The Godfather'))
```

It consists of employees' names and the movies name. Let us add an attribute `type`

to the nodes so that it will be better understandable. This can be achieved by using set_node_attributes() function.

```
# st
Dict = {}
for name in employees:
Dict[name] = {'type' : 'employee'}
for movie in movies:
Dict[movie] = {'type' : 'movie'}
```

```
Dict
```

```
{'Anaconda': {'type': 'movie'},
'Andy': {'type': 'employee'},
'Claude': {'type': 'employee'},
'Forrest Gump': {'type': 'movie'},
'Frida': {'type': 'employee'},
'Georgia': {'type': 'employee'},
'Joan': {'type': 'employee'},
'Kung Fu Panda': {'type': 'movie'},
'Lee': {'type': 'employee'},
'Mean Girls': {'type': 'movie'},
'Monty Python and the Holy Grail': {'type': 'movie'},
'Pablo': {'type': 'employee'},
'Snakes on a Plane': {'type': 'movie'},
'The Dark Knight': {'type': 'movie'},
'The Godfather': {'type': 'movie'},
'The Matrix': {'type': 'movie'},
'The Shawshank Redemption': {'type': 'movie'},
'The Social Network': {'type': 'movie'},
'Vincent': {'type': 'employee'}}
```

```
# feed Dict to set_node_attributes function
nx.set_node_attributes(G_bi, Dict)
```

```
# let us see if it is successful
G_bi.nodes(data = True)
```

```
NodeDataView({'Andy': {'type': 'employee'}, 'Anaconda': {'type': 'movie'}, 'Mean Girls': {'type': 'movie'}, 'The Matrix': {'type': 'movie'}, 'Claude': {'type': 'employee'}, 'Monty Python and the Holy Grail': {'type': 'movie'}, 'Snakes on a Plane': {'type': 'movie'}, 'Frida': {'type': 'employee'}, 'The Shawshank Redemption': {'type': 'movie'}, 'The Social Network': {'type': 'movie'}, 'Georgia': {'type': 'employee'}, 'Joan': {'type': 'employee'}, 'Forrest Gump': {'type': 'movie'}, 'Kung Fu Panda': {'type': 'movie'}, 'Lee': {'type': 'employee'}, 'Pablo': {'type': 'employee'}, 'The Dark Knight': {'type': 'movie'}, 'Vincent': {'type': 'employee'}, 'The Godfather': {'type': 'movie'}})
```

Great!

Since an employee might choose more than one movies, and we want to know how many movies different pairs of employees may choose in common, we can find it through weighted projection graph.

```
p = bipartite.weighted_projected_graph(G_bi,employees)
```

We can use the `to_pandas_edgelist`

functiuon to find the dataframe form of graph.

```
nx.to_pandas_edgelist (p)
```

source | target | weight | |
---|---|---|---|

0 | Andy | Claude | 1 |

1 | Andy | Pablo | 1 |

2 | Andy | Frida | 1 |

3 | Andy | Lee | 1 |

4 | Andy | Joan | 1 |

5 | Andy | Georgia | 1 |

6 | Claude | Georgia | 3 |

7 | Pablo | Frida | 2 |

8 | Pablo | Vincent | 1 |

9 | Frida | Vincent | 2 |

10 | Lee | Joan | 3 |

Notive that the original dataframe does not have the **weight** column. It tells us the number of common interested moveis of two employees. We may also find it by wrangling, however, with Network Analysis we can find it easily with several lines of code. So cool!

Suppose given the two data files, we would like to find out if people that have a high relationship score also like the same types of movies.

```
# read relationship file
df_R = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Social_Network_Analysis_in_Python/resources/Employee_Relationships.txt', delim_whitespace=True,
names = ['Employee1', 'Employee2', 'Relationship'] )
df_R.head()
```

Employee1 | Employee2 | Relationship | |
---|---|---|---|

0 | Andy | Claude | 0 |

1 | Andy | Frida | 20 |

2 | Andy | Georgia | -10 |

3 | Andy | Joan | 30 |

4 | Andy | Lee | -10 |

An overall process can be as follows:

- Find the graph forms of both dataframes.
- Merge the two graphs by
`nx.compose`

method. - Use
`to_pandas_edgelist`

to transform the merged graph to dataframe. - Clean dataframe.
- Find the correlation by
`pd.corr`

function

```
# load in the relationship file
G_relations =nx.from_pandas_edgelist(df_R,'Employee1','Employee2', edge_attr = 'Relationship' )
```

```
# merge the two files
merge_graph = nx.compose(p,G_relations)
df_merge = nx.to_pandas_edgelist (merge_graph )
```

```
df_merge.head(15)
```

source | target | Relationship | weight | |
---|---|---|---|---|

0 | Andy | Claude | 0 | 1.0 |

1 | Andy | Pablo | -10 | 1.0 |

2 | Andy | Frida | 20 | 1.0 |

3 | Andy | Lee | -10 | 1.0 |

4 | Andy | Joan | 30 | 1.0 |

5 | Andy | Georgia | -10 | 1.0 |

6 | Andy | Vincent | 20 | NaN |

7 | Claude | Georgia | 90 | 3.0 |

8 | Claude | Frida | 0 | NaN |

9 | Claude | Joan | 0 | NaN |

10 | Claude | Lee | 0 | NaN |

11 | Claude | Pablo | 10 | NaN |

12 | Claude | Vincent | 0 | NaN |

13 | Pablo | Frida | 50 | 2.0 |

14 | Pablo | Vincent | -20 | 1.0 |

```
# replace NaN with 0
df_merge['weight'] = df_merge['weight'].replace(np.nan, 0)
```

```
df_merge.corr(method ='pearson')
```

weight | Relationship | |
---|---|---|

weight | 1.000000 | 0.906093 |

Relationship | 0.906093 | 1.000000 |

We can see that these two have strong correlation.

For some old versions of networkx, `to_pandas_edgelist`

or `to_pandas_dataframe`

can not give us the desired form of dataframe. In this case, we can slightly the change the step 3 above as follows:

- Find the graph forms of both dataframes.
- Merge the two graphs by
`compose`

method. - Find the edges dictionary and use loop to get merged dataframe.
- Since we make the dataframe from dictionary, there can be nested dictionary in dataframe column. Therefore we need to clean the merged dataframe, in order to get the desired form.
- Find the correlation by
`pd.corr`

function

```
# let us see how the edges look like
merge_graph.edges(data = True)
```

```
EdgeDataView([('Andy', 'Claude', {'weight': 1, 'Relationship': 0}), ('Andy', 'Pablo', {'weight': 1, 'Relationship': -10}), ('Andy', 'Frida', {'weight': 1, 'Relationship': 20}), ('Andy', 'Lee', {'weight': 1, 'Relationship': -10}), ('Andy', 'Joan', {'weight': 1, 'Relationship': 30}), ('Andy', 'Georgia', {'weight': 1, 'Relationship': -10}), ('Andy', 'Vincent', {'Relationship': 20}), ('Claude', 'Georgia', {'weight': 3, 'Relationship': 90}), ('Claude', 'Frida', {'Relationship': 0}), ('Claude', 'Joan', {'Relationship': 0}), ('Claude', 'Lee', {'Relationship': 0}), ('Claude', 'Pablo', {'Relationship': 10}), ('Claude', 'Vincent', {'Relationship': 0}), ('Pablo', 'Frida', {'weight': 2, 'Relationship': 50}), ('Pablo', 'Vincent', {'weight': 1, 'Relationship': -20}), ('Pablo', 'Georgia', {'Relationship': 0}), ('Pablo', 'Joan', {'Relationship': 0}), ('Pablo', 'Lee', {'Relationship': 0}), ('Frida', 'Vincent', {'weight': 2, 'Relationship': 60}), ('Frida', 'Georgia', {'Relationship': 0}), ('Frida', 'Joan', {'Relationship': 0}), ('Frida', 'Lee', {'Relationship': 0}), ('Vincent', 'Georgia', {'Relationship': 0}), ('Vincent', 'Joan', {'Relationship': 10}), ('Vincent', 'Lee', {'Relationship': 0}), ('Lee', 'Joan', {'weight': 3, 'Relationship': 70}), ('Lee', 'Georgia', {'Relationship': 10}), ('Joan', 'Georgia', {'Relationship': 0})])
```

```
# use loop to make dataframe
rows = []
for u,v,r in merge_graph.edges(data=True):
rows.append([u, v, r])
```

```
df_merge = pd.DataFrame(rows, columns=["Employee1", "Employee2", "Common"])
df_merge.head()
```

Employee1 | Employee2 | Common | |
---|---|---|---|

0 | Andy | Claude | {'weight': 1, 'Relationship': 0} |

1 | Andy | Pablo | {'weight': 1, 'Relationship': -10} |

2 | Andy | Frida | {'weight': 1, 'Relationship': 20} |

3 | Andy | Lee | {'weight': 1, 'Relationship': -10} |

4 | Andy | Joan | {'weight': 1, 'Relationship': 30} |

We can see that the `Common`

column is not clean. Here, we can use `apply(pd.Series)`

to split the column to.

```
df_merge['Common'].apply(pd.Series).head()
```

weight | Relationship | |
---|---|---|

0 | 1.0 | 0.0 |

1 | 1.0 | -10.0 |

2 | 1.0 | 20.0 |

3 | 1.0 | -10.0 |

4 | 1.0 | 30.0 |

Then we combine the splitted columns to the original dataframe by `apply`

and `concat`

function.

```
df_merge = pd.concat([df_merge.drop(['Common'], axis=1), df_merge['Common'].apply(pd.Series)], axis=1)
```

```
df_merge.head(15)
```

Employee1 | Employee2 | weight | Relationship | |
---|---|---|---|---|

0 | Andy | Claude | 1.0 | 0.0 |

1 | Andy | Pablo | 1.0 | -10.0 |

2 | Andy | Frida | 1.0 | 20.0 |

3 | Andy | Lee | 1.0 | -10.0 |

4 | Andy | Joan | 1.0 | 30.0 |

5 | Andy | Georgia | 1.0 | -10.0 |

6 | Andy | Vincent | NaN | 20.0 |

7 | Claude | Georgia | 3.0 | 90.0 |

8 | Claude | Frida | NaN | 0.0 |

9 | Claude | Joan | NaN | 0.0 |

10 | Claude | Lee | NaN | 0.0 |

11 | Claude | Pablo | NaN | 10.0 |

12 | Claude | Vincent | NaN | 0.0 |

13 | Pablo | Frida | 2.0 | 50.0 |

14 | Pablo | Vincent | 1.0 | -20.0 |

Cool! Afterwards we can clean the **weight** column and find correlation as mentioned above.

When I was doing the last task, I met troubles with merging the two tables.

Firstly, I tried to use `merge`

function, however, I got stuck in choosing the join-on columns. Because in both dataframes, there are two emplyee names' columns. We do not care about the order of these two columns but the algorithm can detect the difference. Then, I spent a lot of time using the nested loop, and in each iteration, I tried to make a tuple of the two names and sort them, in hope of unifying the names order. However, it fails to give an elegant result.

In the end, I found that composing two graphs is a great solution. If in both graphs have same nodes with same edges, they will be merged into one, while keeping the edge attributes. Therefore my biggest takeaway is that Network Analysis is a great data analysis method, as well as a nice wrangling tool.

Hopefully this post can be helpful to you. Thank you for reading!

Cheers!