Exploring the Dynamics of PageRank in Evolving Networks: A Reproduction Study

This is the group project of SDSC3001 – Big Data: The Arts and Science of Scaling. I did the project in my year 3 2022/23 Semester A.

Presentation Slides:

Course Instructor: Prof. YANG Yu

Table of Contents

Project Requirement

In the project, you and your teammates (at most 4 students per team) need to reproduce a paper about big data techniques. To reproduce a paper, you need to read it and understand its core techniques. If you have trouble understanding the theoretical parts, that is okay. However, you should at least be able to implement the algorithms proposed in the paper. Then you can try to apply your implementation on the datasets used in the paper as well as other datasets available online to test the performance of the proposed algorithms. Finally, you need to write a report that includes

(1) The research problem that the paper wants to address.

(2) The challenges of the research problem.

(3) A brief introduction of the solution proposed by the paper.

(4) The experiments you conduct to test the performance of the proposed algorithms.

(5) Your thoughts about how to improve the proposed algorithms. (Optional)

To find an appropriate paper, you may want to browse recent papers published in major big data conferences such as SIGMOD (https://sigmod.org/), VLDB (https://www.vldb.org/), KDD (https://www.kdd.org/), ICDE (https://icde2021.gr/), etc. Below is a list of recommended papers that you can choose. Note that it is totally okay if you choose a paper that is not on the list.

1. Zhao, Kangfei, et al. “A Learned Sketch for Subgraph Counting.” Proceedings of the 2021 International Conference on Management of Data. 2021.

2. Peng, Jinglin, et al. “AQP++ Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics.” Proceedings of the 2018 International Conference on Management of Data. 2018.

3. Zeighami, Sepanta, Cyrus Shahabi, and John Krumm. “Estimating Spread of Contact-Based Contagions in a Population Through Sub-Sampling.” arXiv preprint arXiv:2012.06987 (2020).

4. Jha, Madhav, C. Seshadhri, and Ali Pinar. “Path sampling: A fast and provable method for estimating 4-vertex subgraph counts.” Proceedings of the 24th international conference on world wide web. 2015.

5. Chen, Xiaowei, et al. “A general framework for estimating graphlet statistics via random walk.” arXiv preprint arXiv:1603.07504 (2016).

6. Malkov, Yu A., and Dmitry A. Yashunin. “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.” IEEE transactions on pattern analysis and machine intelligence 42.4 (2018): 824-836.

7. Qiu, Yuan, et al. “Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries.” Proceedings of the 2021 International Conference on Management of Data. 2021.

8. Dai, Zhenwei, et al. “Active Sampling Count Sketch (ASCS) for Online Sparse Estimation of a Trillion Scale Covariance Matrix.” Proceedings of the 2021 International Conference on Management of Data. 2021.

9. Liang, Xi, et al. “Combining aggregation and sampling (nearly) optimally for approximate query processing.” Proceedings of the 2021 International Conference on Management of Data. 2021.

10. Lucier, Brendan, Joel Oren, and Yaron Singer. “Influence at scale: Distributed computation of complex contagion in networks.” Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015.

11. Wang, Chi, and Bailu Ding. “Fast approximation of empirical entropy via subsampling.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.

12. Wang, Pinghui, et al. “A memory-efficient sketch method for estimating high similarities in streaming sets.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.

13. Bahmani, Bahman, et al. “Pagerank on an evolving graph.” Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. 2012.

14. Dahiya, Yogesh, Dimitris Konomis, and David P. Woodruff. “An empirical evaluation of sketching for numerical linear algebra.” Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.

15. Iyer, Anand Padmanabha, et al. “{ASAP}: Fast, approximate graph pattern mining at scale.” 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 2018.

16. Li, Rundong, et al. “Approximately Counting Butterflies in Large Bipartite Graph Streams.” IEEE Transactions on Knowledge and Data Engineering (2021).

17. Ballard, Grey, et al. “Diamond sampling for approximate maximum all-pairs dot-product (MAD) search.” 2015 IEEE International Conference on Data Mining. IEEE, 2015.

18. Huang, Zengfeng. “Near optimal frequent directions for sketching dense and sparse matrices.” International Conference on Machine Learning. PMLR, 2018.

Introduction

The paper we aim to reproduce is named ‘PageRank on an Evolving Graph’.

Here is the full version of the original research paper.

PageRank is a common measure of the graph and quantity computed based on the graph structure of the web. The existing research paper aims to find the best solutions for computing the PageRank of an evolving graph. It proposes two primary types of algorithms, one randomized and one deterministic. The goal is to design a probing algorithm that maintains a good approximation of the valid PageRank values of the underlying evolving graph. In this paper, we will only implement deterministic methods.

The major challenge of this research paper in the big data problems is that the traditional algorithm can only handle stationary data and cannot apply to the web and social networks. However, in reality, this graph structure changes over time. The traditional algorithm cannot follow the changes in the graph. It is time costing to recalculate the true PageRank of the network after each change. Therefore, we can have an algorithm that can estimate the PageRank of an evolving graph by probing nodes periodically.

Method

For the data set, we used three different data for testing. They are AS, CAIDA, and peer-to-peer networks. The AS dataset was collected from the University of Oregon Route Views Project. The dataset contains 733 daily instances spanning an interval of 785 days from November 1997 to January 2000. CAIDA dataset contains 122 AS graphs, from January 2004 to November 2007. The final dataset comes from a sequence of snapshots of the Internet peer-to-peer file-sharing network from 4 August 2002 to 9 August 2002.

In this paper, we will use two algorithms stated in the paper to implement our research. The first is called Round-Robin Probing. It is a deterministic way to probe nodes by giving the nodes a specific order. Also, this algorithm would act as the baseline for comparison. The second one is Priority Probing. The base idea is that we probe nodes with a frequency proportional to their PageRank in a deterministic method. The detail is that every node has a value call priority. At first, the priority is equal to their PageRank value. In each time step, we probe the node which has the highest priority. After the node is probed, the priority is set to zero. Other nodes’ priority is increased by their PageRank. The intuition of Priority Probing is that if a node with high PageRank changes, it will have a more significant impact on the PageRank of other nodes. As a result, we should probe such nodes more frequently.

There are a lot of initial graphs needed to be created for the datasets. For the AS dataset, we selected the first five files (‘as19971108.txt’, ‘as19971109.txt’, ‘as19971110.txt’, ‘as19971111.txt’, ‘as19971112.txt’). For the CAIDA dataset, we selected the first files (‘as-caida20040105.txt’). For the peer-to-peer file sharing network dataset, we selected the first files (‘p2p-Gnutella04.txt’). Second, we read the following files in the datasets line by line, and the edge will be checked to see if it exists in the graph. If it is a new edge, we count it as a change. Then, we add this to the graph and run the algorithms to get an estimation of PageRank. Finally, we calculate the difference between the estimation and the actual PageRank after each change.

For Round-Robin Probing algorithm, since the original paper does not specify the ordering method of the nodes, therefore, we will order the nodes in numerical order. We probe one node each time according to the ordering index to calculate the PageRank.

For Priority Probing, we create a dictionary to store the priority of the nodes. In each change, we probe the node with the highest priority using the max function. Then, we calculate the PageRank. After that, we set the priority of the probed node to zero and increase other nodes’ PageRank by their current PageRank.

We use the PageRank values of neighbors of the probed nodes to estimate the PageRank of the probed nodes.

The algorithm can guarantee that the expected PageRank of any node x at t+1 can be bound by the following inequality: , giving the graph at time t. indicates the PageRank value of node x in time t.

In the original paper, there was a number, α, which indicated the probing rate by the algorithm in each (edge) change. This means that after every K changes in the graph, the algorithm can perform αK probe. We will also test the effect of α in this paper.

Result

Fig.1.1 and fig.1.2 show the L1 and L∞ error of AS dataset respectively. The L1 error of Priority Probing increases in the beginning and remain steady around 0.125. The error of Priority Probing is lower than Round-Robin at the start. However, the L∞ error of Round-Robin significantly drops in the middle. The L1 error of Round-Robin is low than Priority Probing. We can see a sharp drop in Round-Robin occurs in every 3000 changes and this is the number of node in this dataset. This may be due to the probed node having a large amount of neighbours.

Fig. 2.1 and Fig. 2.2 are the L1 and L∞ errors of CAIDA dataset respectively. The L1 error has a similar result with AS dataset. Priority Probing error is lower than Round-Robin at the beginning. Then Round-Robin has a steep decline in the middle.

Fig. 3.1 and Fig. 3.2 are the L1 and L∞ errors of peer-to-peer file sharing network dataset respectively. We found that the results are different compare to other two datasets. The error of Round-Robin is lower than Priority Probing in most of the time.

In the AS dataset and CAIDA dataset, the Priority Probing achieved lower errors compared to the Round-Robin Probing which is expected as the original research paper has similar results too. But in the P2P networks dataset, the results are quite different. The main reason will be explained in the limitation part.

Here, we used a probing rate(α) equal to two, and we found that it produced a similar result compared with the original paper. As shown in Fig.4.1 and Fig.4.2, the L1 and L∞ are lower than the probing rate equal to 1 in both methods. This can be easily understood because it increases the number of estimations in each change. This provides a more accurate estimation. However, the computational cost also increased.

Here we try to accept node insertion. The methods are that we calculate the true PageRank and give it to the new node. From Fig.5.1 and Fig.5.2, we found that the errors were increased in both algorithms compare to not having any node insertion. Especially in Priority Probing, the error almost doubled.

In general, regarding the result of the AS dataset, there is an overall increasing shape of the error but a decreasing rate on the gradient which still fits our expectation and is similar to the original research paper. But the key point is, there are still some differences when compared to it. Our results have relatively higher errors and also a different decreasing rate on the gradient for both Priority Probing and Round-Robin Probing. So, there is some limitation on reproducing the algorithm obviously.

Conclusion

Limitations

There are several limitations. First, the paper claimed that they had used an initial graph for bootstrapping the process. However, it did not specify how to obtain the graph. Therefore, we used the first few files in the datasets to produce the initial graph. Second, the definition of changing edge is not stated in the paper. We just count the new data as an insertion of the graph. We did not handle any edge deletions in this paper. Third, the insertion of a new node is not mentioned in the paper. We do not know how to handle this situation. If we set the priority and PageRank to zero, then it will never be probed in Priority Probing. We tried giving the true PageRank to the node. However, the error increased significantly.

Tackling Method

Due to the lack of information, we can only find relative information of priority probing from the research paper only, there isn’t any more information about that on the internet anymore. So in order to deal with the limitation, we can only try to simplify the coding process by using a fixed number of nodes, so when a new node is detected from the dataset file, it will be ignored. But the graph is still evolving as the edges keep changing too. Maybe this is the reason why the result is not exactly the same as the original one.

Appedix

import networkx as nx
import numpy as np
import scipy as sp
import os
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------

# Initialize an empty graph
g = nx.Graph()

# List of files containing the graph data
file_list = ['as19971108.txt', 'as19971109.txt', 'as19971110.txt', 'as19971111.txt', 'as19971112.txt']

# Read the graph data from each file and add edges to the graph
for i in file_list:
    f = open(i)
    # Skip the first four lines (header)
    for _ in range(4):
        f.readline()

    # Read each line and add an edge to the graph
    while True:
        line = f.readline()
        if not line:
            break
        line = line.strip().split("\t")
        a = int(line[0])
        b = int(line[1])
        g.add_edge(a, b)
    f.close()

# -----------------------------

# Print the number of nodes in the graph
print(len(list(g)))

# -----------------------------

# Create a copy of the graph for the Round-Robin approach
g_robin = g.copy()

# -----------------------------

# Function to update PageRank values based on edge probing
def evopr(current_g, new_pg):
    alpha = 0.15  # Damping factor for PageRank
    V = list(current_g)  # List of nodes in the graph
    n = len(V)  # Number of nodes
    max_key = max(priority, key=priority.get)  # Node with the highest priority

    # Calculate the new PageRank value for the probed node
    temp = 0
    for u in current_g.neighbors(max_key):
        temp += new_pg[u] / len(list(current_g.neighbors(u)))
    new_pg[max_key] = (1 - alpha) * temp + alpha / n

    # Update priorities based on the new PageRank values
    for i in V:
        priority[i] = priority[i] + new_pg[i]
    priority[max_key] = 0  # Reset the priority of the probed node

    return new_pg

# -----------------------------

# Compute the initial PageRank values
pg = nx.pagerank(g)
global priority 
priority = dict.fromkeys(list(g), 0)  # Initialize priorities for all nodes
current_pg = pg.copy()  # Copy of the current PageRank values

# -----------------------------

# Variables to track the evolution of the graph and errors
count = 0
path = "C:/Users/user/python anaconda/sdsc3001/AS test"
os.chdir(path)
error_list_priority = []  # List to store errors for the Priority approach
error_max_priority = []   # List to store maximum errors for the Priority approach

# Process each file in the directory
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = f"{path}\\{file}"
        with open(file_path, 'r') as f:
            # Skip the first four lines (header)
            for _ in range(4):
                f.readline()

            # Read each line and add new edges to the graph
            while True:
                line = f.readline()
                if not line:
                    break
                line = line.strip().split("\t")
                a = int(line[0])
                b = int(line[1])
                # Only add the edge if both nodes are already in the graph
                if g.has_node(a) and g.has_node(b):
                    if not g.has_edge(a, b):
                        count += 1
                        g.add_edge(a, b)
                        # Update PageRank values using the Priority approach
                        current_pg = evopr(g, current_pg)
                        
                        # Compute the real PageRank values for comparison
                        real = nx.pagerank(g)
                        error = []
                        # Calculate the error between the real and estimated PageRank values
                        for i in list(g):
                            error.append(abs(real[i] - current_pg[i]))
                        error_sum = sum(error)
                        error_max = max(error)
                        # Store the errors for analysis
                        error_list_priority.append(error_sum)
                        error_max_priority.append(error_max)
            f.close()
            print(count)

# -----------------------------

# Plot the L1 error for the Priority approach
df1 = pd.DataFrame(error_list_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_list_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L1 error")
ax.set_title('AS Dataset')
plt.show()

# -----------------------------

# Plot the L∞ error for the Priority approach
df1 = pd.DataFrame(error_max_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_max_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L∞ error")
ax.set_title('AS Dataset')
plt.show()

# -----------------------------

This code simulates the evolution of a graph representing an Autonomous System (AS) and calculates the PageRank values as the graph evolves. It uses a priority-based approach to decide which edges to probe and update the PageRank values accordingly. The errors between the estimated and real PageRank values are tracked and plotted for analysis.

import networkx as nx
import numpy as np
import scipy as sp
import os
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------

# Initialize an empty graph
g = nx.Graph()

# Open the file containing the CAIDA dataset
f = open('as-caida20040105.txt')

# Skip the first eight lines (header and comments)
for _ in range(8):
    f.readline()

# Read each line and add an edge to the graph
while True:
    line = f.readline()
    if not line:
        break
    line = line.strip().split("\t")
    a = int(line[0])
    b = int(line[1])
    g.add_edge(a, b)
f.close()

# -----------------------------

# Print the number of nodes in the graph
print(len(list(g)))

# -----------------------------

# Create a copy of the graph for the Round-Robin approach
g_robin = g.copy()

# -----------------------------

# Function to update PageRank values based on edge probing
def evopr(current_g, new_pg):
    alpha = 0.15  # Damping factor for PageRank
    V = list(current_g)  # List of nodes in the graph
    n = len(V)  # Number of nodes
    max_key = max(priority, key=priority.get)  # Node with the highest priority

    # Calculate the new PageRank value for the probed node
    temp = 0
    for u in current_g.neighbors(max_key):
        temp += new_pg[u] / len(list(current_g.neighbors(u)))
    new_pg[max_key] = (1 - alpha) * temp + alpha / n

    # Update priorities based on the new PageRank values
    for i in V:
        priority[i] = priority[i] + new_pg[i]
    priority[max_key] = 0  # Reset the priority of the probed node

    return new_pg

# -----------------------------

# Compute the initial PageRank values
pg = nx.pagerank(g)
global priority 
priority = dict.fromkeys(list(g), 0)  # Initialize priorities for all nodes
current_pg = pg.copy()  # Copy of the current PageRank values

# -----------------------------

# Variables to track the evolution of the graph and errors
count = 0
path = "C:/Users/user/python anaconda/sdsc3001/as-caida.tar"
os.chdir(path)
error_list_priority = []  # List to store errors for the Priority approach
error_max_priority = []   # List to store maximum errors for the Priority approach

# Process each file in the directory
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = f"{path}\\{file}"
        with open(file_path, 'r') as f:
            # Skip the first eight lines (header and comments)
            for _ in range(8):
                f.readline()

            # Read each line and add new edges to the graph
            while True:
                line = f.readline()
                if not line:
                    break
                line = line.strip().split("\t")
                a = int(line[0])
                b = int(line[1])
                # Only add the edge if both nodes are already in the graph
                if g.has_node(a) and g.has_node(b):
                    if not g.has_edge(a, b):
                        count += 1
                        g.add_edge(a, b)
                        # Update PageRank values using the Priority approach
                        current_pg = evopr(g, current_pg)
                        
                        # Compute the real PageRank values for comparison
                        real = nx.pagerank(g)
                        error = []
                        # Calculate the error between the real and estimated PageRank values
                        for i in list(g):
                            error.append(abs(real[i] - current_pg[i]))
                        error_sum = sum(error)
                        error_max = max(error)
                        # Store the errors for analysis
                        error_list_priority.append(error_sum)
                        error_max_priority.append(error_max)
            f.close()
            print(count)

# -----------------------------

# Plot the L1 error for the Priority approach
df1 = pd.DataFrame(error_list_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_list_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L1 error")
ax.set_title('AS Dataset')
plt.show()

# -----------------------------

# Plot the L∞ error for the Priority approach
df1 = pd.DataFrame(error_max_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_max_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L∞ error")
ax.set_title('AS Dataset')
plt.show()

# -----------------------------

This code is similar to the previous one but uses the CAIDA dataset for the graph. It follows the same approach of simulating the evolution of the graph and calculating the PageRank values as new edges are added.

import networkx as nx
import numpy as np
import scipy as sp
import os
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------

# Initialize an empty graph
g = nx.Graph()

# Open the file containing the p2p-Gnutella04 dataset
f = open("p2p-Gnutella04.txt")

# Skip the first four lines (header)
for _ in range(4):
    f.readline()

# Read each line and add an edge to the graph
while True:
    line = f.readline()
    if not line:
        break
    line = line.strip().split("\t")
    a = int(line[0])
    b = int(line[1])
    g.add_edge(a, b)
f.close()

# -----------------------------

# Print the number of nodes in the graph
print(len(list(g)))

# -----------------------------

# Create a copy of the graph for the Round-Robin approach
g_robin = g.copy()

# -----------------------------

# Function to update PageRank values based on edge probing
def evopr(current_g, new_pg):
    alpha = 0.15  # Damping factor for PageRank
    V = list(current_g)  # List of nodes in the graph
    n = len(V)  # Number of nodes
    max_key = max(priority, key=priority.get)  # Node with the highest priority

    # Calculate the new PageRank value for the probed node
    temp = 0
    for u in current_g.neighbors(max_key):
        temp += new_pg[u] / len(list(current_g.neighbors(u)))
    new_pg[max_key] = (1 - alpha) * temp + alpha / n

    # Update priorities based on the new PageRank values
    for i in V:
        priority[i] = priority[i] + new_pg[i]
    priority[max_key] = 0  # Reset the priority of the probed node

    return new_pg

# -----------------------------

# Compute the initial PageRank values
pg = nx.pagerank(g)
global priority 
priority = dict.fromkeys(list(g), 0)  # Initialize priorities for all nodes
current_pg = pg.copy()  # Copy of the current PageRank values

# -----------------------------

# Variables to track the evolution of the graph and errors
count = 0
path = 'C:/Users/user/python anaconda/sdsc3001/p2p test'
os.chdir(path)
error_list_priority = []  # List to store errors for the Priority approach
error_max_priority = []   # List to store maximum errors for the Priority approach

# Process each file in the directory
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = f"{path}\\{file}"
        with open(file_path, 'r') as f:
            # Skip the first four lines (header)
            for _ in range(4):
                f.readline()

            # Read each line and add new edges to the graph
            while True:
                line = f.readline()
                if not line:
                    break
                line = line.strip().split("\t")
                a = int(line[0])
                b = int(line[1])
                # Only add the edge if both nodes are already in the graph
                if g.has_node(a) and g.has_node(b):
                    if not g.has_edge(a, b):
                        count += 1
                        g.add_edge(a, b)
                        # Update PageRank values using the Priority approach
                        current_pg = evopr(g, current_pg)
                        
                        # Compute the real PageRank values for comparison
                        real = nx.pagerank(g)
                        error = []
                        # Calculate the error between the real and estimated PageRank values
                        for i in list(g):
                            error.append(abs(real[i] - current_pg[i]))
                        error_sum = sum(error)
                        error_max = max(error)
                        # Store the errors for analysis
                        error_list_priority.append(error_sum)
                        error_max_priority.append(error_max)
            f.close()
            print(count)

# -----------------------------

# Plot the L1 error for the Priority approach
df1 = pd.DataFrame(error_list_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_list_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L1 error")
ax.set_title('P2P Dataset')
plt.show()

# -----------------------------

# Plot the L∞ error for the Priority approach
df1 = pd.DataFrame(error_max_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_max_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L∞ error")
ax.set_title('P2P Dataset')
plt.show()

# -----------------------------

This code simulates the evolution of a peer-to-peer (P2P) network graph and calculates the PageRank values as the graph evolves. It uses a similar approach to the previous codes for AS and CAIDA datasets.

import networkx as nx
import numpy as np
import scipy as sp
import os
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------

# Initialize an empty graph
g = nx.Graph()

# Open the file containing the AS dataset
f = open('as19971108.txt')

# Skip the first four lines (header)
for _ in range(4):
    f.readline()

# Read each line and add an edge to the graph
while True:
    line = f.readline()
    if not line:
        break
    line = line.strip().split("\t")
    a = int(line[0])
    b = int(line[1])
    g.add_edge(a, b)
f.close()

# -----------------------------

# Print the number of nodes in the graph
print(len(list(g)))

# -----------------------------

# Create a copy of the graph for the Round-Robin approach
g_robin = g.copy()

# -----------------------------

# Function to update PageRank values based on edge probing
def evopr(current_g, new_pg):
    alpha = 0.15  # Damping factor for PageRank
    V = list(current_g)  # List of nodes in the graph
    n = len(V)  # Number of nodes
    max_key = max(priority, key=priority.get)  # Node with the highest priority

    # Calculate the new PageRank value for the probed node
    temp = 0
    for u in current_g.neighbors(max_key):
        temp += new_pg[u] / len(list(current_g.neighbors(u)))
    new_pg[max_key] = (1 - alpha) * temp + alpha / n

    # Update priorities based on the new PageRank values
    for i in V:
        priority[i] = priority[i] + new_pg[i]
    priority[max_key] = 0  # Reset the priority of the probed node

    return new_pg

# -----------------------------

# Compute the initial PageRank values
pg = nx.pagerank(g)
global priority 
priority = dict.fromkeys(list(g), 0)  # Initialize priorities for all nodes
current_pg = pg.copy()  # Copy of the current PageRank values

# -----------------------------

# Variables to track the evolution of the graph and errors
count = 0
path = 'C:/Users/user/python anaconda/sdsc3001/AS test'
os.chdir(path)
error_list_priority = []  # List to store errors for the Priority approach
error_max_priority = []   # List to store maximum errors for the Priority approach

# Process each file in the directory
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = f"{path}\\{file}"
        with open(file_path, 'r') as f:
            # Skip the first four lines (header)
            for _ in range(4):
                f.readline()

            # Read each line and add new edges to the graph
            while True:
                line = f.readline()
                if not line:
                    break
                line = line.strip().split("\t")
                a = int(line[0])
                b = int(line[1])
                # Only add the edge if both nodes are already in the graph
                if g.has_node(a) and g.has_node(b):
                    if not g.has_edge(a, b):
                        count += 1
                        g.add_edge(a, b)
                        # Update PageRank values using the Priority approach
                        current_pg = evopr(g, current_pg)
                        
                        # Compute the real PageRank values for comparison
                        real = nx.pagerank(g)
                        error = []
                        # Calculate the error between the real and estimated PageRank values
                        for i in list(g):
                            error.append(abs(real[i] - current_pg[i]))
                        error_sum = sum(error)
                        error_max = max(error)
                        # Store the errors for analysis
                        error_list_priority.append(error_sum)
                        error_max_priority.append(error_max)
            f.close()
            print(count)

# -----------------------------

# Plot the L1 error for the Priority approach
df1 = pd.DataFrame(error_list_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_list_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L1 error")
ax.set_title('AS Dataset')
plt.show()

# -----------------------------

# Plot the L∞ error for the Priority approach
df1 = pd.DataFrame(error_max_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_max_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L∞ error")
ax.set_title('AS Dataset')
plt.show()

# -----------------------------

This code is similar to the previous ones, focusing on simulating the evolution of an Autonomous System (AS) graph and calculating the PageRank values as the graph evolves.

import networkx as nx
import numpy as np
import scipy as sp
import os
import pandas as pd
import matplotlib.pyplot as plt

# -----------------------------

# Initialize an empty graph
g = nx.Graph()

# Open the file containing the AS dataset
f = open('as19971108.txt')

# Skip the first four lines (header)
for _ in range(4):
    f.readline()

# Read each line and add an edge to the graph
while True:
    line = f.readline()
    if not line:
        break
    line = line.strip().split("\t")
    a = int(line[0])
    b = int(line[1])
    g.add_edge(a, b)
f.close()

# -----------------------------

# Print the number of nodes in the graph
print(len(list(g)))

# -----------------------------

# Create a copy of the graph for the Round-Robin approach
g_robin = g.copy()

# -----------------------------

# Function to update PageRank values based on edge probing
def evopr(current_g, new_pg):
    alpha = 0.15  # Damping factor for PageRank
    V = list(current_g)  # List of nodes in the graph
    n = len(V)  # Number of nodes
    max_key = max(priority, key=priority.get)  # Node with the highest priority

    # Calculate the new PageRank value for the probed node
    temp = 0
    for u in current_g.neighbors(max_key):
        temp += new_pg[u] / len(list(current_g.neighbors(u)))
    new_pg[max_key] = (1 - alpha) * temp + alpha / n

    # Update priorities based on the new PageRank values
    for i in V:
        priority[i] = priority[i] + new_pg[i]
    priority[max_key] = 0  # Reset the priority of the probed node

    return new_pg

# -----------------------------

# Compute the initial PageRank values
pg = nx.pagerank(g)
global priority 
priority = dict.fromkeys(list(g), 0)  # Initialize priorities for all nodes
current_pg = pg.copy()  # Copy of the current PageRank values

# -----------------------------

# Variables to track the evolution of the graph and errors
count = 0
path = 'C:/Users/user/python anaconda/sdsc3001/AS test newnodes'
os.chdir(path)
error_list_priority = []  # List to store errors for the Priority approach
error_max_priority = []   # List to store maximum errors for the Priority approach

# Process each file in the directory
for file in os.listdir():
    if file.endswith(".txt"):
        file_path = f"{path}\\{file}"
        with open(file_path, 'r') as f:
            # Skip the first four lines (header)
            for _ in range(4):
                f.readline()

            # Read each line and add new edges to the graph
            while True:
                line = f.readline()
                if not line:
                    break
                line = line.strip().split("\t")
                a = int(line[0])
                b = int(line[1])
                flag_a = 0
                flag_b = 0
                # Check if the nodes are new and add them to the graph
                if not g.has_node(a):
                    g.add_node(a)
                    flag_a = 1
                if not g.has_node(b):
                    g.add_node(b)
                    flag_b = 1
                # Add the edge to the graph
                if not g.has_edge(a, b):
                    g.add_edge(a, b)
                    # Update PageRank values using the Priority approach
                    current_pg = evopr(g, current_pg)
                    
                    # Compute the real PageRank values for comparison
                    real = nx.pagerank(g)
                    error = []
                    # Calculate the error between the real and estimated PageRank values
                    for i in list(g):
                        error.append(abs(real[i] - current_pg[i]))
                    error_sum = sum(error)
                    error_max = max(error)
                    # Store the errors for analysis
                    error_list_priority.append(error_sum)
                    error_max_priority.append(error_max)
            f.close()
            print(count)

# -----------------------------

# Plot the L1 error for the Priority approach
df1 = pd.DataFrame(error_list_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_list_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L1 error")
ax.set_title('AS Dataset with New Nodes')
plt.show()

# -----------------------------

# Plot the L∞ error for the Priority approach
df1 = pd.DataFrame(error_max_priority, columns = ['priority_error'])
df2 = pd.DataFrame(error_max_robin, columns = ['robin_error'])
plt.rcParams["figure.autolayout"] = True
ax = df1.plot()
df2.plot(ax=ax)
ax.set_ylabel("L∞ error")
ax.set_title('AS Dataset with New Nodes')
plt.show()

# -----------------------------

This code is similar to the previous ones, but it specifically focuses on simulating the evolution of an Autonomous System (AS) graph with the addition of new nodes. The PageRank values are calculated as the graph evolves, and the errors between the estimated and real PageRank values are tracked and plotted for analysis.