CS224W Review

Pictures from: cs224w.stanford.edu

Not important: Lectures 1, 5, 9, 12, and 14.

Important: 2,4,6,7,8,10,11,13

Topic list (not exhaustive):

  • Centrality measures & PageRank
  • ‼️GNN model & design space (message passing, aggregate, and combine) - core to build GNN
  • knowledge graph embeddings, Query2Box, recommender systems(LightGCN), generative models

Lec1: Intro

Graph Intro

  • picture: grid; NLP: sequences
  • representation Learning:
    • automatically learn the feature of graph
    • similar nodes in the graph are embedded close together
  • Course Outline:
    • image-20211112160030274
  • Graph ML tasks
    • node classification
    • link prediction
    • Graph classification
    • Clustering
  • Edge-level task:
    • recommender system
      • node: user, items
      • edge: user-item interactions
    • PinSage:
      • Node ???, edge ???
    • Drug Side Effect: given a pair of drugs predict adverse side effects
      • Node: drugs and proteins
      • Edges: Interactions (different type edge, different side effect)
  • Graph level task:
    • Drug discovery: node-atoms; edges: chemical bonds
    • physical simulation
  • Graph Types:
    • directed & undirected graphs
      • directed graph node average degree: The (total) degree of a node is the sum of in- and out-degrees.
    • heterogeneous graphs: different types of node and edge
    • Biparitie graph:
      • Folded networks: biparitie graph => projection
    • weighted / un-weighted
    • self-edges
    • Multigraph: multi-edges between two nodes
  • adjacent matrix / Adjacent list
  • Connectivity
    • conneted graph
      • unconnected-graph: adjacency matrix is block diagonal
    • Strongly connected directed graph
      • every node A, B, exist a path from A-B and B-A
    • Weakly connected directed graph

Lec2: Traditional ML

Traditional Methods for Graph ML

  • Traditional ML pipeline uses hand-designed features

  • node level:

node level features

  • Node degree

  • Node centrality: takes node importance in a graph into account

    • eigenvector centrality
      • image-20211112174236662
      • centralitiy is eigenvector of ,use largest eigenvalue and the corresponding
    • Betweenness centrality
      • A node is important if it lies on many shortest paths between other nodes.
      • image-20211112174627303
      • at least 3 hop path ???
    • Closeness centrality image-20211112175232332
  • Clustering coefficient

    • counts #(triangles)
    • image-20211112175223858
  • Graphlets:

    • Rooted connected induced non-isomorphic subgraphs
    • small subgraphs describe the structure of node 's neighborhood
    • GDV: graph degree vector
      • #(graphlets) a node touches
    • induced Subgraph: all the edges an
    • isomorphism: same number of nodes; bijective mapping
    • image-20211112180214224
  • important based features: predict influential nodes in a graph

  • Structure-based features: predict role a node plays

  • key: design features for a pair of nodes

  • link prediction task:

    • links missing at random
    • links over time
  • link prediction methodology:

    • for each pair (x,y), compute score
  • Link-level features

    • distance based featrue
      • shortest-path distance between two nodes
    • local neighborhood overlap
      • image-20211112205248745
      • limitation: metric is always zero if the two nodes do not have any neighbots in common
    • global neighborhood overlap
      • (length K)
      • use = #walks of length l
      • Katz index: number of walks of alll lengths between a given pair of nodes. image-20211112205845763
      • , score for node pair

Graph-level Features

  • kernel methods

    • design kernels instead of feature vectors
    • semidefinite
    • , is a feature representation (Graph feature vector)
    • Bag-of-Words for a graph
  • Graphlet Kernel

    • Graphlet (different definition): not need to be connected; not rooted image-20211112212334153

    • normalize image-20211112212518825

  • Weisfeiler-Lehman Kernel

    • use neighborhood structure to iteratively enrich node vocabulary
    • Color Refinement
      • HASH function:(current node, neighbor nodes)
      • image-20211112212730505
      • total time complexity is linear in #(edges)

Lec3: Representation Learning

Node Embeddings

  • Graph Representation Learning
    • task-independent feature learning for machine leanring with Graphs

Encoder and Decoder

  • Goal: similarity in the embedding space (e.g., dot product) pproximates similarity in the graph

  • Three important things need to define:

    • similarity
      • a measure of similarity in the original network
    • Encoder
      • nodes to embeddings
    • Decoder
      • map from embeddings to the similarity score;
      • Decoder based on node similarity
      • maximize for (u,v) who are similar
        • how to define similar?
    • image-20211112215039105
  • shallow encoding

    • is the matrix to optimize; v is indicator vector(all zeroes except a one in column indicating node v)
    • encoder is just a embedding-lookup
    • Shallow: each node is assigned a unique embedding vector
      • directly optimize the embedding of each node
    • Substitue method/Deep encoders(GNNs): DeepWalk, node2vec ????
  • key: define node similarity (can use )

  • Unsupervised/self-supervised learning embeddings

Random Walk Approches for node Embeddings

  • Random Walk => define node similarity (Decoder)

  • Probability : predicted probability of visiting node on random wals starting from node

    • is the embedding of node , our target
    • Non-linear functions used to produce predicted P
      • Softmax: turn K real values into K probabilities that sum to 1
      • Sigmoid
  • Random walk:

    • Is it a definiation of similarity

    • If and co-occur on a random walk, is similar; high-order multi-hop information

      • incorporates both local and high-order neighborhood information
    • the probability that u and v co-appear on a random walk

    • steps:

      • use random walk to get
      • optimize embedding to encode the
  • Unsupervised Feature Learning

    • Learn node embedding such that nearby nodes are close together in the network
      • nearby: random walk co-occurrance
      • nearby nodes: neighborhood of obtained by some random walk strategy
    • As Optimization - Random Walk Optimization
      • image-20211112233359268
      • Steps:
        • run short fixed-length random walks
        • get multiset of nodes visited on random walks starting from
        • Optimize embeddings according to predicted neighbors
  • Optimization target of random walk

    • image-20211113104712498
    • image-20211113112151437
    • image-20211113105220353
    • image-20211113105244094
    • Negtive sampling to approximate
      • k random negtive samples image-20211113110008841
      • Stochastic Gradient Descent: evaluate it for each individual training example
  • How random walk

    • Deepwalk: fixed-length, unbiased
    • node2vec: biased 2nd order random walk
  • Node2vec: Biased Random walk

    • trade off between local and global views
    • BFS, DFS
    • return parameters : return to where it comes
    • in-out parameters : ratio of BFS vs. DFS
      • image-20211113162303752
    • BFS like: low ; DFS: low

Embedding Entire Graphs

  • Approach1: sum

  • Approach2: create a virtual node, and connect the node with each/every node in subgraph/total graph

  • Approch3: Anonymous Walk

    • image-20211113165354826
    • States in anonymous walks correspond to the index of the first time we visited the node
    • Represent the graph as a probability distribution over these walks.
  • Approach 3: Anonymous Walk Embeddings

    • Idea 1: Sample the anon. walks and represent the graph as fraction of times each anon walk occurs.

    • Idea 2: Learn graph embedding together with anonymous walk embeddings.

Lec4: Graph as Matrix

Outline:

  • Determine node importance via random walk (PageRank)
    • stochastic matrix M
  • Obtain node embeddings via matrix factorization (MF)
  • View other node embeddings (e.g. Node2Vec) as MF

PageRank

  • flow model:

    • If page i with importance has out-links, each link gets votes
  • Stochastic adjacency matrix M (column stochastic)

    • ‼️ ji not ij image-20211113191542284
    • flow equation: r=Mr
    • stationary distribution
      • rank vector is an eigenvector of the column stochastic matrix (with eigenvalue 1)
  • Solve PageRank

    • Power iteration: image-20211113202058871

    • Use last iteration value?

    • problems:

      • Dead ends;
        • Solution: teleport with prob.
        • Dead ends ,with prob. 1.0 teleport
      • Spider traps: spider traps absorb all importance
  • PageRank

    • each step:
      • with prob. , follow a link at random
      • image-20211113203919644
      • image-20211113204114485
      • image-20211113205355320

Random walk with restarts and PPR

  • PPR: Personalized PageRank
    • Ranks proximity of nodes to the teleport nodes (start nodes)
      • eg, recommend related items to item Q, teleport could be {Q}
      • image-20211113211001505
  • PageRank: teleport to any node, same probability
    • Personalized PageRank: different landing probability of each node
    • Random walk with restarts: always to the same node
  • Random walk is to get the teleport nodes

Matrix Factorization

  • Matrix Factorization:

    • get node embedings
    • shallow embedding, ENC(v) = ;
    • similarity: node u, v are similar if they are connected by an edge
    • is generally not possible
      • learn approximately
      • opptimize L2 norm
  • DeepWalk and node2vec have more complex node similarity - different complex matrix

  • limitations of MF and random walk

    • Cannot obtain embeddings for nodes not in the training set
    • Cannot capture structural similarity
      • DeepWalk and node2vec do not capture structural similarity
    • Cannot utilize node, edge and graph
      • DeepWalk/node2vec embeddings do not incorporate node features
  • node embeddings based on random walks can be expressed as matrix factorization.

Lec5: Node Classification

Message Passing & Node classificaiton

  • not represent learning

  • collecitve classification: assigning labels to all the nodes in a network together

    • correlations: homophily, influence
  • Methods:

    • Relational classification
    • Iterative classification
    • Correct & Smooth

Relational Classification

  • idea: Propagate node labels across the network

  • steps:

    • initialize unlabeled nodes
  • Let the posibility that node Yi belongs to class 1

    • Update each node one by one (use the new ) image-20211005135147227

    • Limits:

      • convergence is not guaranteed (use a maximum number of iterations)
      • Model cannot use node feature information

Iterative Classification

  • relational classifier does not use node attributes
  • iterative classifier: attributes and labels
  • image-20211114002254366
  • image-20211114002605906
  • could be:
    • histogram of the number/fraction of each label in
    • most common label in
    • Number of different labels in
    • could base on neighbors's label or neighbors' features
    • could be incomplelete
      • could use -1 to denote
      • If use predict label, it wil bring wrong information to . We do not want to have wrong information.
  • Only use to initialize, and use to iterate (to update )

Correct & Smooth

  1. Train base predictor(Liner Model or MLP)
  2. Use the base predictor to predict soft labels of all nodes.
  3. Post-process the predictions using graph structure to obtain the final predictions of all nodes.(correct + smooth)
  • step 1&2 =
  • Correct step:
    • Diffuse traingin error: same like pageRank/ Flow formulation
    • like a random walk distribute matrix
    • diffusion matrix : image-20211114114748099
      • A: adjacency matrix + self loop
      • D = diag(d1, d2,.., dn)
    • image-20211114114325818
  • Smooth Step
    • image-20211005150323169
    • Smoothen the prediction of the base predictor (smoothen prediction with neighbors)
  • Examples
    • image-20211114114422089
  • diffusion matrix :
    • all the eigenvalues are in [-1, 1]
    • , eigenvector is
    • image-20211114115245911

Lec6 GNN

Graph neural Networks (Deep graph Encoders)

ENC(v) = GNN, can be combined with node similarity functions in Lec3

Basics of Deep Learning

ML = optimization

  • Supervised Learning: (today's setup) input , label
  • taget :
  • Loss function: L2 function
    • Don't care about the value of L, only care about (variable parameters)
  • Loss function Example: Cross Entropy(CE)
    • the lower the loss, the closer the prediction is to one-hot
  • Gradient vector/ Gradient Descent
    • Gradient Descent is slow: need to calculate the entire dataset
  • SGD: Stochastic Gradient Descent
    • Minibatch
    • Concepts: batch size; Iteration(on mini batch); Epoch(on the entire dataset)
    • SGD is unbiased estimator of full gradient
  • Back Propagation
    • predefined building block functions
    • each such 𝑔 we also have its derivative 𝑔′
    • chain rule: image-20211007140625481
    • Hidden Layer
    • Forward propagation
  • Non-linearity
    • ReLU
    • Sigmoid
  • MLP(Multi-layer Perceptron)
    • bias vector can be write into : add a dimision of , always be 1

Deep learning for graphs

  • A naive Apporach: use adjacent matrix

    • limitation:
      • too many parameters
      • different size of graph
      • sensitive ot node ordering
  • Convolutional Networks

  • For graph, Permutation Invariance

    • Different order plan, same result of

    • target: learn who is permuation invariance

    • node feature

    • : permutation equivariance function

    • GNN consist of multiple permutation equivariant / invariant functions

    • Equivariance and Invariance has a little difference

Graph Convolutional Networks

  • Output: Node embeddings/ subgraph embeddings/ graph embeddings

  • Node takes information from it's neighbors

  • Node's computation graph:

    • every node defines a different computation graph
  • Deep Model: Many Layers

    • a node can appear in multiple places

Neighborhood Aggregation

  • Basic approach: average neighbor messages and apply a nerual network image-20211007144111114

  • and is related to layer number

  • Loss function:

    • Parameters:
    • Assume: parameter sharing ??? (maybe sharing )
    • rewrite into matrix formulation image-20211007150400208
    • is sparse
  • Unsupervised Training

    • use network similarity to provide loss function / as the supervision
    • Similar nodes have similar embeddings: Random walk, matrix factorization, node proximity
    • image-20211007150311958
  • Supervised Training

    • Use loss function like CE image-20211114162041404
    • Steps:
      • define a neigghborhood aggregation function
      • Define a loss function on the embeddings
      • Train on a set of nodes, i.e., a batch of compute graphs
      • Generate embeddings for nodes as needed
    • shared parameters
  • Inductive Capability

    • Train on a graph A and collected data about organism B
    • Train on part of nodes, generated node embedings on different nodes(never trained with)
      • Because the is shared
      • Create computational graph of a new node
      • forward
  • Difference of CNN and GNN

    • CNN is not permutational equivariant
    • image-20211114164855037
    • Learn different for different neighbor of pixel

Lec7 General Perspective of GNN

Building blocks of GNN:

  • message
  • aggregation
  • Layer connectivity
  • Graph augmentation
  • learning objective image-20211114183535405

A single layer of GNN

message + aggregate

  • message:

    • image-20211114183415727
  • Aggregation:

    • image-20211114183645298
  • aggregate() is only for neighbors, can ignore self-loop

  • image-20211114183804792

  • GCN: classical GNN layer

    • image-20211114200219810
  • GraphSAGE:

    • image-20211114201023383
    • Two stage aggregation
      • aggregate neighbors
      • aggregate over node itself
    • can be what? (In paper, different kinds of AGG), has to be differentiable
      • Mean
      • Pool
      • LSTM
    • L2 normalization: to every layer
      • image-20211114200836483
  • GAT

    • In GCN, image-20211114201228678 is 1/degree_v,

    • GAT: make learnable

      • compute attention coefficients:

        image-20211114201441500

      • Nomorlize using softmax

      • What is a()?

        • eg, a single layer NN
        • image-20211114202333695
        • The parameters are jointly learned
      • Multi-head attention: output are aggregated(concat or sum)

  • Batch Normalization

  • Dropout:

    • use in linear layer of message function
  • Activation function/non linear

Stack GNN layers

  • over-smoothing
    • shared neighbors grow quickly
    • adding more GNN layers do not always help
  • Expressive power of shallow GNN
    • Increase the layers in message & aggregation
    • Add layers that do not pass messages
    • Add skip connections
      • image-20211114211057687

Graph Manipulation Approaches

  • Graph Freature manipulation

    • feature augmentation
      • assign constant node feature
      • assign one-hot node feature
      • Add traditional node feature/embedding
  • Graph Structure manipulation

    • too sparse: add virtual nodes/edges

      • add virtual edges: use instead of , use bipartite graphs
      • add virtual nodes: connect every node in graph
    • too dense: sample neighbors when message passing

      • differ from drop out?
        • neighborhood sampling is not random. Drop out is random ?
        • drop out is for training; neighborhood sampling is for both training and testing
      • Neighorhood sampling is not random sample
        • but use random sampling to pick the most important neighbors
        • sort the neighbors with visit count and pick top k
    • too large: sample subgraph to comput embeddings

Lec8 Prediction

Prediction with GNNs

  • input graph => GNN => node embeddings => prediction head

    • different task levels require different prediction heads
  • node level:

    • directly make prediction using node embeddings
  • edge-level prediction

    • Head_edge could be:

      • concatenation + linear

      • dot product (only for 1-way prediction)

      • dot product for k-way prediction

        image-20211114213816851

  • Graph-level image-20211114214032254

    • similar to AGG() in a GNN layer
    • Head_graph could be:
      • global mean pooling
      • global max pooling
      • global sum pooling
    • global pooling will lost information
  • Hierarchical Global Pooling

    • DiffPool
      • GNN A: compute embeddings
      • GNN B: compute the cluster taht a node belongs to

GNN training

  • supervised / unsupervised
  • classification / regression
    • classification loss: cross Entropy
    • Regression loss: MSE loss
  • Metrics for Binary Classification:
    • Accuracy
    • Precision
    • Recall
    • F1-score
  • ROC curve/ROC AUC

Setting-up GNN prediction Tasks

  • dataset split:
    • Training seg
    • validation set
    • test set
  • Solutions:
    • Transductive setting: input graph can be observed in all the dataset;
      • only split the node labels
      • only comput loss on part of nodes
    • Inductive setting: break the edges
  • Link Prediction task:
    • We need to create the labels and dataset splits on our own
    • hide some edges from the GNN
    • Assign 2 types of edges in the original graph
      • message ages for message passing
      • supervision edges, take away, not the input to GNN
        • calculate the loss on supervision edges
    • Transductive
      • 4 types of edges
      • image-20211115110016298
        1. => (2) => (3) view as time passing, it's very natural

Lec9 GNN Expressiveness

  • GCN(mean-pool): Element-wise mean pooling + linear + ReLU non-linearity

    • GNN caputres local neighborhood structures using computational graph
    • GNN only see node features in computational graph
    • Can not distinguish:
      • same computational graph
      • same node feature
  • GraphSAGE(max-pool)

  • injective function: most expressive GNN model should map subtrees to the node embeddings injectively

    • the most expressive GNN use injective function
    • GNN can fully distinguish different subtree structures if every step of its neighbor aggregation is injective
    • expressiveness depends on neighborhodd aggregation function
      • GCN(element-wise mean-pool)
      • GraphSAGE(element-wise max-pool)
  • Design injective function using neural network

    • image-20211115190337563
    • use multi layer perceptron to approximate
    • GIN(Graph Isomorphism Network):
      • GIN uses a neural network to model the injective HASH function of WL kernel
      • key is to use elment-wise sum pooling, not max/mean

## Lec10 KG Completion

  • KG: l=0 is not a problem, h=t is a problem. h and t are not distinguishable

Heterogeneous Graph

  • multiple edges types and multiple node types

  • extend GCN to handle heterogeneous graph

Relational GCN

  • aggragate using different relation type

  • color = different relation type

  • image-20211023212217218

    • over all the relation type, over all the neighbors
    • Message:each neighbor of a given relation + self-loop
    • Aggregation: Sum over messages from neighbors and self-loop, then apply activation
    • image-20211117135024081
  • Scalability: (L layers of the computational graph)

    • Each relation has L matices:
    • A lot of parameters -> overfitting issue
    • Two methods to regularize the weights [avoid too much parameters and overfitting]
      • use block diagonal matirces
      • Basis / Dictionary learning
  • Block Diagnoal Matrices

    • make the weights sparse

    • B low dimensional-matrices, then # param reduces from to

  • Basis Learning:

    • Share weights across different relations
    • , a_rb is for each relation, and V_b is shared for all relations
      • only needs to learn
      • only need to learn B scalars
    • B group of similar relations, within a group, the relation share
  • Eg, node classification with RGCN

    • use the representation of the final layer represent the probability of that class
  • Eg: link predicition with RGCN, what kind of relation

    • (E,r3,A), input is the final layer of E and A,
    • use a relation-specific score functionimage-20211117140018768
    • use negive sampling
    • Training supervision edge / Traning message edges
      • image-20211023222216511
    • negtive edge: actually does not exsit
      • not belongs to exsiting edges
      • exsiting edges = training message edges & training supervision edges
      • Maximize the score of training supervision edge
      • Minimize the score of negtive edges
    • validation edge: not visible when training
    • training message edges & training supervision edges: all existing edges
    • Evaluation:
      • image-20211117141023338
    • Metrics:
      • Hits@k
      • Reciprocal Rank
    • Become a ranking problem

Knowledge Graphs Completion

  • common knowledge graph database missing many true edges -> KG completion
  • Task: For a given (head, relation), we predict missing tails
  • Edges in KG are triples (h, r, t)
  • Model entities and relations in the embedding/vector space .
  • Given a true triple (h, 𝑟, 𝑡), the goal is that the embedding of (h, 𝑟) should be close to the embeding of t
  • Method: shallow embedding [Not training a GNN]
    • GNN is not used with shallow embeddings

####TransE

  • vector sum :
  • the score of should be greater than the some random negtive edges
Connectivity Pattern
  • Symmetry relation
  • Inverse relation
  • Composition/ Transitive Relations
  • 1-to-N relations
  • TransE can not model: symmetric relations & 1-to-N relations[‼️reasons]

TransR

  • design a new space for each relation and do translation in relation-specific space
  • : projection matrix for relation r
  • r(h,t) / (h,r,t) =>
  • TransR can not model composition relations [‼️reasons]
    • each relation has a different spaces, can not map back

DisMult

  • bilinear modeling

  • dot product

  • Scorefunc:

  • cannot model antisymmetric relations & inverse relations & composition relations

ComplEx

  • embeds entities and relations in Complex vector space
  • conjugate: ??
  • ComplEx can model: antisymmetric, symmetric, inverse, 1-N
  • can not model: composition

Summary

  • image-20211117144358553

  • These methods can not model hierarchical relations

  • Difference: how they are thinking the geometry of the relation, and how to move around in the space

  • people don't combine shallow embedding with neural network

Lec11 Reasoning over KGs

  • multi-hop query
  • difficulty: the KG is incomplete and unknown, missing relations;
    • completed KG is dense graph
    • so implicitly impute and account for the incomplete KG
      • Two ways:
        • Embedding query into a single point (like TransE)
        • Embed query into a hyper-rectangle(box)

Predictive Query

  • Types:
    • one-hop-query:
      • find the tail
      • (e:Fulvestrant, (r:Causes))
    • Path query: (e:Fulvestrant, (r:Causes, r:Assoc))
      • image-20211026134054419
      • Query plan of path queries is a chain
      • Traverse the graph
      • ouput: a set f nodes
    • Conjunctive Query: ((e:BreastCancer, (r:TreatedBy)), (e:Migraine, (r:CausedBy))
      • start from 2 entity
    • KG Completion <-> One -hop query
  • Difficulty: KG is incomplete and unknow
    • completed KG is a dense graph, time complex
    • traversal problem -> prediction task
  • Task: answer path-based queries over an incomplete knowledge graph
    • generalization of the link prediction
  • Hypergraph: a link connects a lots of nodes

Answering predictive queries

  • Map graph & query into embeding space
    • answer are node close to query
  • Idea 1: Traversing KG in vecotr Space
    • generalize TransE
    • (r: learned relationship embedding)
    • t: potential tail
    • Multi-hop:
    • TransE can not model 1-to-n relatiosn, so Query2box can not handle 1-to-N
    • TransE can handle composititonal relations
  • Can we answer logic conjuction operation
    • set intersection
    • image-20211026141136285
    • gray node in the picture is a set of entities
  • Idea2: Query2Box

Query2Box

query -> box

  • intersection of 2 boxes is a box
    • Things to learn:
      • entity embedding : zero dim box
      • Relation embeddings: 1 dim box
      • intersection operator f
    • optimize the node postiion and the box and operator together, so the nodes in the box
  • for every relation embedding r -> projection Operator P
    • cen
    • off
    • Box: a set of nodes
    • image-20211026141712059
    • move and resize the box using
  • Geometric Intersection Operator 𝓘 -> learned
    • image-20211117160902072
    • image-20211117160933127
      • to extract the representation of input box
  • score function:
    • L1 distance
    • if the entity is inside box:
    • if outside, need to add
    • d_box(q, v): image-20211117161104091
    • f= -d_box
    • outside box: a lot of penalties; inside box, a little bit penalty
  • intersection could happen at anywhere in the query(could at early stage)
  • How to implement AND-OR query?
    • Union is impossible -> Allowing union over arbitrary queries requires high-dimensional embeddings
    • image-20211026143051229
    • For 3 points, 2-dimension is OK. 4 point is not OK.
      • If we want to "v2 or v4", but do not want v3, we cannot do that in 2-dimensional. We need another dimension to take v3 out.
  • take all unions out and only dounion at the last step
    • Union at last step
      • image-20211117161703103
    • Disjunctio(分离) of conjunctive queries

How to Train

  • Trainable parameters:
    • Relation: 2 vectors:
      • one to move the box into embedding space
      • one to resize
    • Entity embeddings
    • Intersection operator
  • traing set = query + anwsers + negtive anwsers (non - anwser)
    • maximize the score 𝑓" 𝑣 for answers 𝑣 ∈ 𝑞 and minimize the score 𝑓"(𝑣′)
    • image-20211117162333062
  • start with the anwser entity and go backwards
    • wheter we can try to excute the query plan backward
  • Steps of instantiated query q?
    • start with query template (viewed as an abstraction of the query)
    • generate a query by instatiating every variable with a concrete entity and relation from the KG
      • Start from instatntiating the answer node; randomly pick on entity from KG as the root node,
      • Ranomly sample one relation asscicated with current entity

Example

  • string insturment ---(projection)---->a set of different string instruments --(projection)---> the instrumentalists who play string instruments

  • FP/false positive: not play string instruments but in the box

    • FN: who play string instruments but not in the box

Lec12 Motif

Fast Neural Subgraph Matching and Counting

Subgraphs and Motifs

  • node-induced subgraph
  • edge-induced subgraph
  • Graph isomorphism:
    • G1 map to. G2, bijection
  • Subgraph isomorphism
  • Graph-level frequency:
    • hom much subgraphs induced by differetn
  • Node-level subgraph frequency definition
    • Node-anchored subgraph
  • motifs:
    • Pattern
    • Recurring: frequency
    • Significant: random
  • Graphlet is used to define what happens with a given node
    • graphite has a root,
    • Motif do not have a center
  • Subgraph frequency
    • number of unique subsets
    • Node-level Subgraph Frequency:
      • Anchor map to different nodes
  • Motif Significance
    • Subgraphs that occur in a real network much more often than in a random network
    • Generating random graph - NULL model
      • ER random graph
        • create n ndoes, for each pair of nodes () flip a biased coin with bias 𝑝
      • Keep same degree of the nodes, generate random graph
        • create nodes with spokes
        • randomly pair up mini-nodes
      • Generate random graph: switching
        • select a pair of edges and exchange endpoint
        • nodes keep it's degree
        • computationally complex
    • Use statistical measures to evaluate how significant is each motif
      • use Z-score
        • negtive value: under-representation
        • positive value: over-representation
      • image-20211104113849940
      • Network significance profile(SP): normalized Z-scores

Subgraph Representations/Matching

  • Task: is a query graph a subgraph in the target graph

  • Do it in embedding space

    • consider a binary prediction
    • work with node-anchored definitions
  • decompose input big graph into small neighborhoods

  • Steps:

    • using node-anchored definitions, with anchor
    • obtain k-hop neighborhood aroud the anchor
  • embedding is less than or equal to in all coordinates - Order Embedding Space

  • partial ordering

    • non-negtive in all dimensions
  • Subgraph isomorphism relationship can be encoded in order embeeding space

    • Transitivity:
    • Anti-symmetry
    • Closure under intersection
  • Loss function: max-margin loss

    image-20211104210016477

  • Train on positive and negitve examples

    • learn from postivie and negtive sample
    • positive sample: minimize loss function
    • negtive sample: minimize max(0, -E(Gq,Gt))
      • prevents the model from learning the degenerate strategy of moving embeddings further and further apart forever
  • Generate positive and negative examples

    • use BFS sampling
  • Subgraph Predictions on New Graphs

    • Gq: query, Gt: target
    • reqeat check for all Traverse every possible anchor

Finding Frequent Subgraphs

  • Representation learning

  • SPMiner: identify frequent motifs

    • Decompose input graph into neighborhoods
    • Embed neighborhoods into an order embedding space
    • Grow a motif (Motif walk) by iteratively chossing a neighbor in
  • Super-graph region:

    • image-20211118122041298
  • Motif walk

Lec13 Recommender System

Recommand System as bipartite graph

  • Given: past use-item interaction
  • Task: preidct new items each user will interact in the future
    • link prediction task
    • recommend K items
      • K is very small
  • Evaluation Metric: Recall@K

Embedding based model

  • training recall@K is not differentiable
    • Two surrogate loss function (differentiable)
    • binary loss
    • Bayesian Personalized Ranking loss:
  • Positive edges and negtive edges
    • Binary loss pushes the scores of positive edges higher than those of negative edges
  • Binary Loss Issue:
    • only consider all the positive edges
    • not consider personallize, not consider every user.
    • Metric is personlized.
  • BPR(Bayesian Personalized Ranking loss)
    • define the rooted positive/negtive edges
    • For a fix user, positive interaction score is higher than negtive users
    • Average over users
    • image-20211118133413750
  • Collaborative filtering - Why embedding models work
    • useing collaborators to filter the items
    • Embedding-based models can capture similarity of users.

Nerual Graph Collaboratvie Filtering

  • no user/item features

  • based on shallow encoders

    • image-20211104140737005
    • only implicitly captured in training objective
    • if u, v has an edge, the f should be high
    • Only the first-order graph structure

NGCF

  • explicitly incorporates high-order graph structure when generating user/item embeddings
  • ordiniral GNN has user/item features, learn GNN parameters
    • no node feature, so can not directly use GNN
  • NGCF jointly learns both user/item embeddings and GNN parameters
  • Steps:
    • prepare shallow embedding
    • use GNN to propgate the shallow embeddings
      • different GNN parameters for user and for items

LightGCN

  • NGCF learns two kinds of parameters
    • shallow learnable embeddings are quite expressive
  • Overview of LightGCN:
    • Adjacency matrix
    • Use matrix formulation of GCN
    • Simplification of GCN - removing non-linearity
  • Simplifiying GCN
    • Diffusing node embeddings
    • GCN:image-20211118140657946
      • Diffusion Matrix
    • Simplifiying GCN: image-20211118141300798
    • remove ReLU
    • image-20211118141406120
      • each matrix multiplication diffuses current embeddings to their one-hop neighbors
  • Multi-Scale Diffusion
    • image-20211118141622172
    • hyper-parameters
    • take the embeddings of every layer and average (LightGCN)
  • Similar uses share common neighbors and are expected to have similar future preferences
  • not able to deal with new products - all the items should be available when training
    • cold start problem of the recommend system
  • Difference from GCN:
    • GCN add self-loop in neighborhood definition
    • LightGCN: self loop is not added in the neighborhood definition
  • Difference from MF:
    • MF uses shallow user/item embeddings for scoring
    • LightGCN uses diffused user/item embeddings for scoring
  • LightGCN: learnable parameters are all in the shallow input node embeddings (E)

PinSAGE

  • not learning shallow embeddings for user/items.
    • if learn, it's not inductive; PinSAGE is inductive
  • picture 2 picture recommend
  • pictures pin to same board
  • key idea:
    • share the same set of negtive samples across all users in a mini-batch
    • Curriculum learning:
      • hard negative example: thank you card and happy brithday card
      • make extremely fine-grained predctions / add hard negtive
        • at n-th epoth, add n-1 hard negative items
      • generate hard negtives: take item from another recommend system 200th - 5000th;
        • itemn ode that are close but not connected to the user node
      • hard negtive are shared within mini-batch
    • Negitve sampling: sample from different distance

Lec14 Community Detection

Flow of Information

  • Two persepective on friendships:

    • link from a triangle: strong link
      • long-range edges allow you to gather information from different parts of the network and get a job
    • Weak link: long-range edges
    • Strong relationship within clusters, weak relationships connect different clusters
    • clusters: friend's friend is easy to become friends;
  • Triadic closure = high clustering coeffcient

  • Edge overlap:image-20211109135308023

Network Communities

  • How to discover clusters
  • Modularity Q: measure of how well a network is partitioned into communities
    • Need a null model
  • Null model: Configuration Model
    • same degree distribution, uniformly random connections
    • 2m: total 2m half-edges
    • multigraph
    • , for every edge of i, times the probability of have an edge wih node j
    • image-20211118160302003
  • Modularity Q=0, no cluster structure
    • Modularity applies to weighted and unweighted networks image-20211118160706129
    • k_i, k_j are degree of the node
    • Q= -1, anti-community. We call us a community, but common friends are in another group
    • Q=1, well structured community
  • We can identitfy communities by maximizing modularity

Louvain Algorithm: community detection

  • greedy algorithem
    • Phase1: Modularity is optimized by allowing only local changes to node-communities memberships
    • Phase2: group into super-nodes, consider as a weighted graph
      • weight = number of edges in the original graph
  • hierachical structure, 启发性的
  • Output of Louvain algorithm depends on the order the nodes are considered
  • Calculating
    • image-20211118165229655
    • image-20211109143454787
    • image-20211118165407890
  • Within community: counting 2 times every edge
    • edge between community: counting 1 time (actually twice, half into one end, another into another end)
  • Can NOT decided the number of communities ahead

Detecet overlapping Communities

  • the model is a bipartite graph

    • community nodes
    • Original nodes
    • relationship: menberships M
  • AGM: generative model:

    • generate from bipartite graph to a network
  • Target: from an unlabeled graph, generate bipartite graph

    • Given F, generate a probability Adjanct matrix
    • Maximum likelihood estimation
      • Given real graph G and parameter F
    • relax AGM:
      • image-20211118170943454
    • node u and v can connected via mltiple communities
    • size: # community * # number of nodes
    • Log likehood: avoid super small value
  • NOCD model:

    • image-20211111134725276
    • Use log-likelihood in target function:
    • Neural Overlapping Community Detection(NOCD)
    • use GNN to learn F, learn and
    • Real-world graphs are often extremely sparse.
      • model spend a lot of time on edges don't exsit
      • Use negtive sampling to approximate the second item(edges not exist )
        • inside a given batch
    • Decide the number of communities:
      • look at objective function, change number of communities and see the change of L(F)

Lec15 Deep Generative Models for Graphs

Graph generation

Graph Generation: Deep Graph Decoders

  • only have access of samples from p_data
    • learn p_model, then sample more graph from p_model
  • Goal:
    • graph model fitting: get
      • maximum likelihood
    • generate more sampling
  • Auto-regressive models:
    • chain rule:
      • based on previous steps/ generate results

GraphRNN:

  • RNN:

    • hidden state + input => output

    • RNN for sequence generation. deterministic

  • generate graph sequentially: sequentially adding nodes and edges

  • A graph + a node ordering = A sequence of sequences

    • learn to how to print adjacency matrix
    • add node x, ask node x who connect to him
  • Node-level RNN & edge level RNN

  • We want our model stochastic, not deterministic

    • flip a coin with bias to decided whether the edge exist.
    • output is a probability of a edge existence
  • Training:

    • RNN get supervision every time image-20211111144604916
    • Edge level RNN only predict the new node will connect to each of the previous node
  • Test:

    • use GraphRNN's own predictions
  • Node ordering - BFS

    • Benefits: 1. reduce possible node orderings; 2. reduce steps for edge generation
  • How to compare two graph statistics: Earth Mover Distance (EMD)

  • How to compare two sets of graph statistics

    • Maximum Mean Discrepancy (MMD) based on EMD
  • GCPN: combines graph representation + RL

    • GNN captures graph hiddent state explicit

Exam Prep OH

  • Miscellaneous Problems

image-20211114174540432