Fundamentals of Learning on Graph Models

Kübra Taşcı Kardaş · Bugün 16:53

Machine learning has historically thrived on regular structures. Images are grids of pixels, text is a sequence of tokens. But a significant portion of real-world data does not conform to either shape. Molecules, social networks, citation graphs, road maps, knowledge bases... All of these are graphs.

Understanding why standard deep learning architectures fail on such data, and what Graph Neural Networks do differently, is the starting point for anyone working in this space.

What is a Graph?

A graph G = (V, E) is defined by a set of nodes V and a set of edges E connecting pairs of nodes. Edges can be undirected, as in a Facebook friendship where the relationship is mutual, or directed, as in a Twitter follow where the connection is asymmetric. Edges can also carry weights: the travel time between two intersections, or the ticket price between two cities.

Graphs can be homogeneous, where all nodes and edges are of the same type, or heterogeneous, where multiple types coexist. A citation network like Cora is homogeneous: every node is a paper and every edge is a citation. An e-commerce platform is heterogeneous: it has User, Product, and Brand nodes, connected by edges of types such as purchased, rated, or produces. Heterogeneous graphs require specialized architectures to handle their added complexity.

Why Standard Architectures Fall Short

The core difficulties are structural. A Multi-Layer Perceptron (MLP) expects a fixed-size input vector, but graphs vary in the number of nodes and edges. A Convolutional Neural Network (CNN) exploits the spatial regularity of pixel grids, applying fixed-size filters across a uniform neighborhood structure. This is a property graphs do not share, since a node's degree can range from one to several thousand with no notion of spatial direction. An RNN processes ordered sequences, but graphs have no canonical node ordering and no canonical traversal path.

Beyond these architectural mismatches, any model operating on graphs must satisfy a fundamental constraint: permutation invariance for graph-level tasks, and permutation equivariance for node-level tasks. The output of the model must not depend on the arbitrary order in which nodes happen to be stored. Standard MLPs and CNNs do not satisfy this property.

Representing Graphs Numerically

A Graph Neural Network (GNN) requires two primary inputs.

The adjacency matrix A is an N × N matrix where A[j] = 1 if nodes i and j are connected. For undirected graphs, it is symmetric. In practice, because most real-world graphs are sparse, storing a dense N × N matrix is wasteful. PyTorch Geometric and similar frameworks instead use a sparse COO format, where an edge_index tensor of shape [2, E] where each column encodes a source-target pair. This reduces memory complexity from O(N²) to O(E), which is essential at scale.

The node feature matrix X has shape N × F, where each row is the feature vector of a node. In a citation network, this could be a word embedding of the paper's abstract. In a social network, it might encode age, location, or activity patterns. Together, A encodes who is connected to whom, and X encodes what each node is like. These two matrices serve as the joint input to nearly every GNN architecture. Edges and the graph as a whole can also carry features, bond types in a molecular graph for instance, stored in separate tensors.

Three Levels of Graph Machine Learning Tasks

Graph tasks are organized into three categories, each requiring a different output structure from the model.

Node-level tasks involve predicting a label or property for each node. Classifying a research paper by topic, flagging a social media account as malicious, or predicting the biological function of a protein all fall into this category.

Edge-level tasks involve predicting properties of connections between pairs of nodes. Link prediction, estimating the probability that an edge exists between two nodes, is the most common form. Applications include product recommendation in e-commerce, friend suggestion on social platforms, and forecasting high-traffic routes in transportation networks. The standard setup treats existing edges as positive examples and a sample of absent edges as negative examples, learning to score node pairs by their likelihood of connection.

Graph-level tasks involve classifying or regressing over entire graphs. Predicting whether a molecule is toxic, estimating its solubility, or assessing whether an online community exhibits extremist communication patterns are all graph-level problems.

The Message Passing Mechanism

The central idea behind GNNs is neighborhood aggregation, sometimes called message passing. Rather than processing a node in isolation, a GNN constructs a richer representation for it by summarizing the features of its immediate neighbors. The intuition is simple: you can learn a great deal about a node from the company it keeps.

A single GNN layer is broken into two steps. In the AGGREGATE step, feature vectors from all neighboring nodes are collected and combined into a single summary vector. This function must be permutation invariant, meaning the result should not change regardless of the order in which neighbors are processed. Common choices are sum, mean, and max. In the UPDATE step, the aggregated message is combined with the node's current representation, passed through a learnable linear transformation and a non-linearity, and produces a new embedding.

Stacking multiple layers extends the receptive field: a two-layer GNN incorporates information from two-hop neighbors, a three-layer GNN from three-hop neighbors, and so on. However, stacking too many layers introduces a well-known failure mode called over-smoothing, where node representations converge to nearly identical vectors as information diffuses uniformly across the graph. Most practical GNN architectures are deliberately shallow, two to four layers, for this reason.

When choosing an aggregation function, mean is a reliable general-purpose default. Sum can be more expressive when node degree carries semantic meaning, though it may require normalization for training stability. Max is appropriate when the task depends on identifying the single most salient neighbor feature rather than a summary of all of them.

Three Foundational Architectures
Graph Convolutional Networks

GCN defines a specific, fixed aggregation scheme: a degree-normalized sum of neighbor features. Self-loops are added to the graph (Â = A + I) so that a node's own features are included in its update. The normalization prevents high-degree nodes from producing disproportionately large feature vectors. The only learnable parameters are the weight matrices of the linear transformations. The aggregation weights themselves are not learned. This makes GCN simple, computationally efficient, and a strong baseline.

Its limitations are equally important to understand. GCN is transductive: it learns embeddings for the specific nodes present during training and cannot generalize to new nodes added after the fact without retraining. It is also isotropic, treating every neighbor as equally important. And like all deep GNNs, it is susceptible to over-smoothing.

GraphSAGE

GraphSAGE addresses both of GCN's primary constraints. First, instead of aggregating over the full neighborhood, it samples a fixed number of neighbors at each layer. This makes the computation tractable for very large graphs and enables mini-batch training. Second, the aggregation function is learnable rather than fixed. Options include mean, max-pooling, or an LSTM over a randomly permuted set of neighbors.

Most importantly, GraphSAGE is inductive. Rather than memorizing embeddings for specific nodes, it learns a general function that maps a node's local neighborhood structure and features to an embedding vector. This function can be applied to any node, including nodes never seen during training, making GraphSAGE well-suited for dynamic graphs and production systems where the graph evolves over time.

Graph Attention Networks

GCN assigns aggregation weights based solely on node degree, implying that all neighbors are equally informative. In many graphs, this assumption is wrong. In a citation network, a reference from a foundational paper should carry more influence than one from an obscure workshop contribution.

GAT addresses this by introducing a learnable attention mechanism. For each pair of connected nodes, an attention coefficient is computed by concatenating their embeddings, applying a linear transformation, and passing the result through a LeakyReLU. These raw scores are then normalized across each node's neighborhood using a softmax, producing coefficients that sum to one. The aggregation step is a weighted sum using these coefficients rather than a uniform average.

GAT also supports multi-head attention: multiple independent attention mechanisms are run in parallel and their outputs are concatenated or averaged. For example, with heads=4 and out_channels=16 and concat=True, the output dimension passed to the next layer is 64. This adds expressivity and stabilizes training, at the cost of higher computational demand.

Choosing the Right Architecture

The choice among these architectures is largely determined by the structure of the problem.

GCN is the appropriate starting point for small, static, in-memory graphs. It is fast to implement, straightforward to tune, and competitive on standard benchmarks such as Cora and PubMed.

GraphSAGE is the right choice when the graph is too large for full-batch training, when new nodes will be added after training, or when the system needs to generalize across entirely new graphs. Its inductive nature and sampling-based training are designed precisely for these scenarios.

GAT is worth the additional complexity when neighbor importance is highly variable and central to the task. Protein interaction networks, knowledge graphs, and any domain where relationships are not uniform in relevance are strong candidates.

Figure 1: Induction, Transduction and Deduction [1]

Implementation Notes

A few recurring issues are worth addressing directly for anyone moving from theory to practice.

On training correctness: if model weights are not updating, the first place to check is whether loss.backward() and optimizer.step() are both being called, and whether gradients are flowing through the computation graph as expected.

On evaluation: model.eval() and torch.no_grad() must be used together during inference. Without them, Dropout remains active in training mode, introducing randomness into evaluation and causing inconsistent accuracy measurements.

On the mask pattern for transductive learning: in single-graph datasets such as Cora, the full graph participates in the forward pass. This allows neighborhood information to flow across all nodes. However, the loss is computed only over nodes where train_mask is True. Removing this mask causes the model to compute gradients against placeholder label values for unlabeled nodes, which is invalid. Including validation or test nodes in the loss constitutes data leakage.

On multi-label classification: when a node can belong to multiple categories simultaneously, such as a paper covering both machine learning and computer vision, the correct configuration is independent Sigmoid activations with Binary Cross-Entropy loss, not Softmax. Softmax constrains probabilities to sum to one, which is appropriate for mutually exclusive classes. Sigmoid treats each label as an independent binary decision.

Conclusion

Graph neural networks remain an active and rapidly evolving research area. Mitigating over-smoothing through residual connections, handling heterogeneous graphs with type-aware message passing, and scaling to billion-node graphs through hierarchical sampling are among the open problems attracting the most attention. The architectures covered here (GCN, GraphSAGE, and GAT) represent the foundation.

Resources

[1]
Bu bağlantıyı görüntüleyebilmek için kayıt olmalı zaten üyeyseniz üye girişi yapmalısınız.

Bu bağlantıyı görüntüleyebilmek için kayıt olmalı zaten üyeyseniz üye girişi yapmalısınız.

Fundamentals of Learning on Graph Models

Kübra Taşcı Kardaş

Misafir

What is a Graph?

Why Standard Architectures Fall Short

Representing Graphs Numerically

Three Levels of Graph Machine Learning Tasks

The Message Passing Mechanism

Three Foundational Architectures

Graph Convolutional Networks

GraphSAGE

Graph Attention Networks

Choosing the Right Architecture

Implementation Notes

Conclusion

Resources

Gizliliğinize değer veriyoruz

Fundamentals of Learning on Graph Models

Kübra Taşcı Kardaş

Misafir

What is a Graph?​

Why Standard Architectures Fall Short​

Representing Graphs Numerically​

Three Levels of Graph Machine Learning Tasks​

The Message Passing Mechanism​

Three Foundational Architectures​

Graph Convolutional Networks​

GraphSAGE​

Graph Attention Networks​

Choosing the Right Architecture​

Implementation Notes​

Conclusion​

Resources​

Gizliliğinize değer veriyoruz

What is a Graph?

Why Standard Architectures Fall Short

Representing Graphs Numerically

Three Levels of Graph Machine Learning Tasks

The Message Passing Mechanism

Three Foundational Architectures

Graph Convolutional Networks

GraphSAGE

Graph Attention Networks

Choosing the Right Architecture

Implementation Notes

Conclusion

Resources