Depth-First Search

26 Jun, 2019

This article is an overview of Depth-First Search (DFS), a very common graph discovery algorithm. We'll also discuss some iterative adaptations.

Traversal

Here's a bare minimum description of the traversal.

dfs(node) {
    foreach c in node.children :
        dfs(c)
}

The above presumes that each node in the graph leads only to either undiscovered nodes or the visited node is itself terminal, i.e. it leads nowhere. In other words, the simple traversal above assumes that there are no cycles or self-linking.

When a node is visited, it can be processed, before its children or after.

dfs(node) {
    before_children_processing(node)      // optional
    foreach c in node.children :
        dfs(c)
    after_children_processing(node)       // optional
}

If you need to account for the possibility that a node can link to a previously visited node, then you must keep track of the discovery status of nodes. Let's say that we set the initial discovery status of all nodes to nil, here's an example that accounts for visited nodes.

discovery(graph) {
    initialize(graph) // presumably sets all node.status to nil
    foreach node in graph | if node.status==nil:
        dfs(node)
}

dfs(node) {
    node.status = DISCOVERED
    before_children_processing(node)      // optional
    foreach c in node.children {
        if c.status==DISCOVERED:
            continue                      // skip to next node
        dfs(c)
    }
    after_children_processing(node)       // optional
}

Processing

In practice, the processing of a node depends on the business requirements and usually one of "before-children" or "after-children" strategy is chosen, rarely both. Of the two, "before-children" tends to be simpler and should probably be favored, if possible. Its iterative adaptation is also relatively easier to reason about (as we'll see later). "After-children" is usually preferred only in situations where current processing really depends on the outcome of having previously completed the processing of descendant nodes.

Node discovery and traversal cycles

TODO: Describe cycles

In more rigorous or complex situations, the discovery state of nodes may be used to detect cycles or repeat visits of nodes during traversal. The three commonly used states are:

nil (aka undiscovered, not-visited, unknown, white, initialized),
DISCOVERED (aka partially-visited, known, gray, started),
and PROCESSED (aka fully-visited, completed, black, done).

    discovery(graph) {
        initialize(graph) // presumably sets all node.status to nil
        foreach node in graph | if node.status==nil:
            dfs(node)
    }

    dfs(node) {
        before_children_processing(node)         // optional
        node.status = DISCOVERED
        foreach c in node.children {
            if c.status == DISCOVERED:
                handle_cycle(node, c)            // e.g. stop traversal, log error, skip
            else if c.status == PROCESSED:
                handle_multiple_parents(node, c) // e.g. skip, log
            else if node.status == nil:
                dfs(c)
        }
        after_children_processing(node)          // optional
        node.status = PROCESSED
    }

In the above, the status is set only on the parent, and tested only on the children. Other approaches may choose to do things in a somewhat different arrangement. I find the above cleaner and practical. When the function is entered the node is DISCOVERED; and upon exiting, the node is PROCESSED. For its children, the different status checks then have specific implications:

if the child's status is DISCOVERED, it means that it's being revisited while in the midst of its own discovery. That is, it has been previously passed to the dfs() routine and one of its descendants is somehow linking back to it, before its initial discovery could be concluded and could mark it as PROCESSED. This is a cycle (aka a back-edge). In many practical problems it's analogous to an infinite loop and implies that some provisions must be made for that situation (e.g. skipping the node, or ending the discovery altogether with an error).
if the child's status is PROCESSED, it means that it was previously discovered and completely processed. Being revisited as a child of the current node implies that either the two share an earlier ancestor or parent, or the child was selected for discovery before the current node, directly from the graph (see the discovery() function above). Although this is not considered a cycle, it still usually requires special handling (e.g. skipping processing, logging the discovery, etc).
otherwise, if the node is still undiscovered (i.e. its discovery status is nil), the dfs() function can safely be called on it.

Iterative DFS

The various iterative adaptations to DFS make use of stacks to mimic the traversal properties of their recursive counterparts. As a reminder, a stack is a last-in/first-out (LIFO) structure, meaning that the last item to be inserted will be first popped out.

When processing a node before its children

Iterative DFS adaptations are simpler when children nodes are expected to be processed after their parent.

If there's no need to keep track of discovery status (because a node can only link to undiscovered children or be terminal), it can be as simple as:

iterative_dfs(starting_node) {
    node_stack = Stack()
    push(node_stack, starting_node)
    while not empty(node_stack) {
        node = pop(node_stack)
        process(node)
        for c in node.children:
            push(node_stack, c)
    }
}

If processing a node depends on the result of its parent's processing, then that can also be tracked on a stack. For example

iterative_dfs(starting_node) {
    node_stack = Stack()
    result_stack = Stack()
    push(node_stack, starting_node)
    push(result_stack, nil)
    while not empty(node_stack) {
        node = pop(node_stack)
        parent_result = pop(result_stack)
        result = process(node, parent_result)
        for c in node.children:
            push(node_stack, c)
            push(result_stack, result)
    }
}

Or for a more succinct approach, if your programming language allows such idioms:

iterative_dfs(starting_node) {
    stack = Stack()
    push(stack, {starting_node, nil})
    while not empty(stack) {
        {node, parent_result} = pop(stack)
        result = process(node, parent_result)
        for c in node.children:
            push(stack, {c, result})
    }
}

If you need to account for the possibilities that a node may be visited multiple times, track the discovery status as seen previously.

iterative_dfs(starting_node) {
    stack = Stack()
    push(stack, starting_node)
    while not empty(stack) {
        node = pop(stack)
        process(node)
        node.status = DISCOVERED 
        for c in node.children {
            if c.status == DISCOVERED:
                // handle revisit
                log("Ignoring: Node {c} has already been visited")
                continue
            // if c.status == nil:
            push(stack, c)
        }
    }
}

Processing a node after its children and cycle detection

The previous iterative adaptations successfully emulate most cases of "before children" processing. Upon closer inspection, however, they're not an exact analog to their recursive counterparts. They're only equiped to track multiple visits of a node, and specifically in the situation where it's processed after its parent. They're not very helpful if you need to detect cycles in the graph traversal, as that requires a closer, more faithful, emulation of the recursive algorithm.

Observe for instance that in the recursive version where we track both DISCOVERED and PROCESSED states, a node is always marked PROCESSED before its parent. This is crucial to distinguish between simple repeat visits and cycles.

In the iterative adaptations we've seen so far, there's no simple way to achieve this. We need a more complete version that accounts for the discovery status and for the complete processing of nodes before their parent. It goes as follows:

discovery(graph) {
    foreach node in graph:
        if node.status==nil:
            iterative_dfs(node)
}

iterative_dfs(starting_node) {
    stack = Stack()
    push(stack, starting_node)
    while not empty(stack) {
        node = pop(stack)
        if node.status==nil:
            node.status = DISCOVERED            // change node's discovery status
            push(stack, node)                   // then place it back on the stack
            foreach child in node.children:     // iterate through children as seen previously
                if child.status==DISCOVERED:
                    exit("Error: cycle detected")
                else if child.status==PROCESSED:
                    log("Node {child} is being revisited")
                else if child.status==nil:
                    push(stack, child)
        else if node.status==DISCOVERED:
            process(parent)
            node.status = PROCESSED
        else if node.status==PROCESSED:
            pass                                // do nothing                   
    }
}

To solidify our understanding and intuition of this traversal, we'll go through a more descriptive approach, where we track the node on a stack, along with the intended next action. There are three possible actions nil, BEGIN_DISCOVERY, and END_PROCESSING. In the recursive version of DFS, they respectively correspond to skipping a call to the dfs() function when a node is already processed, entrance into the function where the node is first discovered, and exit from the function where the node is processed.

The latter two actions thus trigger a specific change of status on the node, which must be consistent with its current status. Its children nodes can then be tested to detect cycles and revisits, as seen previously.

iterative_dfs(starting_node):
    stack = Stack()
    push(stack, {starting_node, BEGIN_DISCOVERY})
    while stack is not empty:
        {node, next_action} = pop(stack)
        if next_action is BEGIN_DISCOVERY:
            node.status = DISCOVERED
            push(stack, {node, END_PROCESSING})
            for child in node.children:
                if child.status is DISCOVERED:
                    exit("Error: cycle detected")
                else if child.status is PROCESSED:
                    log("Node {child} is being revisited")
                else if child.status is nil:
                    push(stack, {child, BEGIN_DISCOVERY})
        else if next_action is END_PROCESSING:
            after_children_processing(node)
            node.status = PROCESSED
        else if next_action is nil:
            do nothing

The next_action to apply has a 1:1 relationship with the current status of the node, and therefore can simply be inferred from it. If next_action is:

BEGIN_DISCOVERY, it implies that the node is currently undiscovered. That is, node.status should currently be nil and changed to DISCOVERED.
END_PROCESSING, it implies that node.status should currently be DISCOVERED and changed to PROCESSED.
nil, it implies that node.status should currently be PROCESSED and no longer changed.

Substituting tests on next_action with simply their node.status equivalent, we get:

iterative_dfs(starting_node):
    stack = Stack()
    push(stack, starting_node)
    while stack is not empty:
        node = pop(stack)
        if node.status is nil:
            node.status = DISCOVERED
            push(stack, node)
            for child in node.children:
                if child.status is DISCOVERED:
                    exit('Error: cycle detected')
                else if child.status is PROCESSED:
                    log("Node {child} is being revisited")
                else if child.status is nil:
                    push(stack, child)
        else if node.status is DISCOVERED:
            after_children_processing(node)
            node.status = PROCESSED
        else if node.status is PROCESSED:
            do nothing

In the above, when a node is popped from the stack at the onset of a loop, its discovery status is verified.

If it's just about to be DISCOVERED, its status is thus updated and the node is placed back on the stack. Its children are then iterated through and added to the stack when appropriate. This arrangement causes children nodes to be popped from the stack before the parent (during later iterations of the while loop) and thus also processed first.

If the popped node is currently DISCOVERED, it means that all its children were popped from the stack and processed. It can also be processed.

If the popped node is already PROCESSED, it implies a simple revisiting of a fully discovered node, which can be ignored.

A slightly different but equivalent approach is to peek on the stack (rather than popping) and only later, pop PROCESSED nodes.

iterative_dfs(starting_node):
    stack = Stack()
    push(stack, starting_node)
    while stack is not empty:
        node = peek(stack)              // just peeking, not popping.
        if node.status is nil:
            node.status = DISCOVERED
            for child in node.children:
                if child.status is DISCOVERED:
                    exit('Error: cycle detected')
                else if child.status is PROCESSED:
                    log("Node {child} is being revisited")
                else if child.status is nil:
                    push(stack, child)
        else if node.status is DISCOVERED:
            after_children_processing(node)
            node.status = PROCESSED
        else if node.status is PROCESSED:
            pop(stack)                  // popping