Linearization of a direct acyclic graph

14 Aug, 2022

Introduction

The purpose of linearization (aka topological sorting) is to clarify the order of a set of items, based on their relationship. For example, a schedule of events can be produced based on their chronology, or a course curriculum prepared based on pre-requisites, or installation of packages can be managed based on dependencies, or a child class's method can be resolved based on its ancestry mapping.

All directed acyclic graphs or DAG (graphs with no cycles) can be linearized. Conversely, graphs that contain cycles are difficult or impossible to linearize.

Linearization steps:

1. Depth First Search (DFS).
2. Order by decreasing exit time.
3. A node discovered twice before being completely processed implies a cycle.

Linearization problem solving strategy

Recognizing a linearization problem

Identifying a linearization problem is an important first step toward efficient resolution. The description usually points to items needing to be arranged, processed, or discovered in a specific sequence (e.g. schedule, itinerary, recipe, etc). Note, however, that linearization differs from ordinary sorting. In the latter, items usually already have a specified attribute with a given magnitude, upon which a sorting algorithm can be applied (e.g. size, length, weight, etc). Linearization problems rather involve temporal, sequential, or hierarchichal relationships between items, where an item must be processed or accessed in some way before another.

Seeing the graph

Linearization is a graph problem. The nodes are the items and the edges are the hierarchical relationship between them. The path to a solution thus involves a proper representation of this relationship in a data structure. Which nodes are going to be the parents and which will be the children? Although the relationship can be represented either way (i.e. either nodes in a relationship can be the parent or the child), it's important to understand the implications of the representation flowing one way or the other, and how that direction can affect the solution.

Approach 1: X depends on Y

In this representation of the relationship a node will point to its dependencies. That is, a node points to nodes it depends on and which must be processed before it.

X   depends on  [Y, ...]

A       =>      [B, C, D]
B       =>      [C]
C       =>      [E]
D       =>      [F]
E       =>      [F]
F       =>      []

Using some basic datastructures the relationships could be represented as follows:

values = ['A', 'B', 'C', 'D', 'E', 'F']
edges = [
    ('A','B'), ('A','C'), ('A','D'),
    ('B','C'), ('C','E'), ('D','F'),
    ('E','F')
]

And building the graph could go along these lines.

make_graph(values, edges) {
    graph = {}
    for name in values :
        node = Node(name=name, state=nil, children=[])
        push(graph, node)
    for parent, child in edges :
        p = graph[parent]
        c = graph[child]
        push(p.children, c)
    return graph
}

graph = make_graph(values, edges)

To linearize the graph we iterate through its nodes, applying a DFS traversal on the undiscovered ones. After you've completely processed an item's children, all its dependencies should be accounted for and linearized. The node itself thus becomes a viable candidate for linearization. A key insight here is that an item that has no predecessors (i.e. no children) is inherently linearized and would mark the end of the DFS traversal for that branch. This is used as a base case to prevent unnecessary recursive calls. Also, we make use of a DFS property to detect cycles, which happen when a node that has not been completely processed is visited a second time.

graph = make_graph(values, edges)
linearization(graph) {
    order = []
    for node in graph {
        if node.status is PROCESSED:
            continue // skip
        dfs(node, order)
    }
    return order
}

dfs(node, order) {
    node.status = DISCOVERED
    for c in node.children {
        if c.status is PROCESSED:
            log("Cross-edge. Skipping.")
            continue
        if node.status is DISCOVERED:
            error("Cycle detected in graph. Impossible to linearize.")
            exit
        dfs(c, order)
    }
    order.add(node)
    node.status = PROCESSED
}

Approach 2: X has dependent Y

This is the reverse of the previous relationship. Here the graph is organized to point a node to those that immediately depend on it. That is, the node must be completely processed before its dependents.

X   has dependent [Y, ...]

A       =>        []
B       =>        [A]
C       =>        [A, B]
D       =>        [A]
E       =>        [C]
F       =>        [D, E]

values = ['A', 'B', 'C', 'D', 'E', 'F']
edges = [
    ('B','A'), ('C','A'), ('C','B'),
    ('D','A'), ('E','C'), ('F','D'),
    ('F','E')
]

Similarly to the previous approach we need to determine which nodes have no predecessors. We could also use a DFS algorithm here, but we'll use a different approach consisting in keeping a count of a node's dependencies and to decrement it as we're removing those dependencies from the board. The following implementation of this algorithm takes an iterative approach.

graph = make_graph(values, edges)
for node in graph:
    for c in node.children:
        c.dependencies += 1

Now that we have a count of every node's dependencies, we cull all nodes that have no dependencies and place them in a queue (or a stack). There should be at least one node with no dependencies, else it implies that the graph has a cycle and linearization is not possible.

q = Queue()
for node in graph:
    if node.dependencies==0:
        push(q, node)
if empty(q):
    error("Every node has a dependency! Cycle detected.")

Now we count down from the total number of nodes to zero, adding them to the linearized list as we process them and their children. That is, remove a dependency count from their children.

linearized = []
node_count = size(graph)
while not empty(q){
    node = q.pop()
    for c in node.children {
        c.dependencies -= 1
        if c.dependencies==0:
            q.add(c)
        if c.dependencies < 0:
            error("Cycle detected: linearization impossible.")
            exit
    }
    push(linearized, node)
    node_count -= 1
}
if node_count > 0:
    throw Exception("Some nodes remaining to process! Cycle detected.")