Depth-First Search
This article is an overview of Depth-First Search (DFS), a very common graph discovery algorithm. We'll also discuss some iterative adaptations.
Traversal
Here's a bare minimum description of the traversal.
dfs(node) {
foreach c in node.children :
dfs(c)
}
The above presumes that each node in the graph leads only to either undiscovered nodes or the visited node is itself terminal, i.e. it leads nowhere. In other words, the simple traversal above assumes that there are no cycles or self-linking.
When a node is visited, it can be processed, before its children or after.
dfs(node) {
before_children_processing(node) // optional
foreach c in node.children :
dfs(c)
after_children_processing(node) // optional
}
If you need to account for the possibility that a node can link to a previously visited node, then you must keep track of the discovery status of nodes. Let's say that we set the initial discovery status of all nodes to nil
, here's an example that accounts for visited nodes.
discovery(graph) {
initialize(graph) // presumably sets all node.status to nil
foreach node in graph | if node.status==nil:
dfs(node)
}
dfs(node) {
node.status = DISCOVERED
before_children_processing(node) // optional
foreach c in node.children {
if c.status==DISCOVERED:
continue // skip to next node
dfs(c)
}
after_children_processing(node) // optional
}
Processing
In practice, the processing of a node depends on the business requirements and usually one of "before-children" or "after-children" strategy is chosen, rarely both. Of the two, "before-children" tends to be simpler and should probably be favored, if possible. Its iterative adaptation is also relatively easier to reason about (as we'll see later). "After-children" is usually preferred only in situations where current processing really depends on the outcome of having previously completed the processing of descendant nodes.
Node discovery and traversal cycles
TODO: Describe cycles
In more rigorous or complex situations, the discovery state of nodes may be used to detect cycles or repeat visits of nodes during traversal. The three commonly used states are:
nil
(aka undiscovered, not-visited, unknown, white, initialized),DISCOVERED
(aka partially-visited, known, gray, started),- and
PROCESSED
(aka fully-visited, completed, black, done).
discovery(graph) {
initialize(graph) // presumably sets all node.status to nil
foreach node in graph | if node.status==nil:
dfs(node)
}
dfs(node) {
before_children_processing(node) // optional
node.status = DISCOVERED
foreach c in node.children {
if c.status == DISCOVERED:
handle_cycle(node, c) // e.g. stop traversal, log error, skip
else if c.status == PROCESSED:
handle_multiple_parents(node, c) // e.g. skip, log
else if node.status == nil:
dfs(c)
}
after_children_processing(node) // optional
node.status = PROCESSED
}
In the above, the status is set only on the parent, and tested only on the children. Other approaches may choose to do things in a somewhat different arrangement. I find the above cleaner and practical. When the function is entered the node is DISCOVERED
; and upon exiting, the node is PROCESSED
. For its children, the different status checks then have specific implications:
- if the child's status is
DISCOVERED
, it means that it's being revisited while in the midst of its own discovery. That is, it has been previously passed to thedfs()
routine and one of its descendants is somehow linking back to it, before its initial discovery could be concluded and could mark it asPROCESSED
. This is a cycle (aka a back-edge). In many practical problems it's analogous to an infinite loop and implies that some provisions must be made for that situation (e.g. skipping the node, or ending the discovery altogether with an error). - if the child's status is
PROCESSED
, it means that it was previously discovered and completely processed. Being revisited as a child of the current node implies that either the two share an earlier ancestor or parent, or the child was selected for discovery before the current node, directly from the graph (see thediscovery()
function above). Although this is not considered a cycle, it still usually requires special handling (e.g. skipping processing, logging the discovery, etc). - otherwise, if the node is still undiscovered (i.e. its discovery status is
nil
), thedfs()
function can safely be called on it.
Iterative DFS
The various iterative adaptations to DFS make use of stacks to mimic the traversal properties of their recursive counterparts. As a reminder, a stack is a last-in/first-out (LIFO) structure, meaning that the last item to be inserted will be first popped out.
When processing a node before its children
Iterative DFS adaptations are simpler when children nodes are expected to be processed after their parent.
If there's no need to keep track of discovery status (because a node can only link to undiscovered children or be terminal), it can be as simple as:
iterative_dfs(starting_node) {
node_stack = Stack()
push(node_stack, starting_node)
while not empty(node_stack) {
node = pop(node_stack)
process(node)
for c in node.children:
push(node_stack, c)
}
}
If processing a node depends on the result of its parent's processing, then that can also be tracked on a stack. For example
iterative_dfs(starting_node) {
node_stack = Stack()
result_stack = Stack()
push(node_stack, starting_node)
push(result_stack, nil)
while not empty(node_stack) {
node = pop(node_stack)
parent_result = pop(result_stack)
result = process(node, parent_result)
for c in node.children:
push(node_stack, c)
push(result_stack, result)
}
}
Or for a more succinct approach, if your programming language allows such idioms:
iterative_dfs(starting_node) {
stack = Stack()
push(stack, {starting_node, nil})
while not empty(stack) {
{node, parent_result} = pop(stack)
result = process(node, parent_result)
for c in node.children:
push(stack, {c, result})
}
}
If you need to account for the possibilities that a node may be visited multiple times, track the discovery status as seen previously.
iterative_dfs(starting_node) {
stack = Stack()
push(stack, starting_node)
while not empty(stack) {
node = pop(stack)
process(node)
node.status = DISCOVERED
for c in node.children {
if c.status == DISCOVERED:
// handle revisit
log("Ignoring: Node {c} has already been visited")
continue
// if c.status == nil:
push(stack, c)
}
}
}
Processing a node after its children and cycle detection
The previous iterative adaptations successfully emulate most cases of "before children" processing. Upon closer inspection, however, they're not an exact analog to their recursive counterparts. They're only equiped to track multiple visits of a node, and specifically in the situation where it's processed after its parent. They're not very helpful if you need to detect cycles in the graph traversal, as that requires a closer, more faithful, emulation of the recursive algorithm.
Observe for instance that in the recursive version where we track both DISCOVERED
and PROCESSED
states, a node is always marked PROCESSED
before its parent. This is crucial to distinguish between simple repeat visits and cycles.
In the iterative adaptations we've seen so far, there's no simple way to achieve this. We need a more complete version that accounts for the discovery status and for the complete processing of nodes before their parent. It goes as follows:
discovery(graph) {
foreach node in graph:
if node.status==nil:
iterative_dfs(node)
}
iterative_dfs(starting_node) {
stack = Stack()
push(stack, starting_node)
while not empty(stack) {
node = pop(stack)
if node.status==nil:
node.status = DISCOVERED // change node's discovery status
push(stack, node) // then place it back on the stack
foreach child in node.children: // iterate through children as seen previously
if child.status==DISCOVERED:
exit("Error: cycle detected")
else if child.status==PROCESSED:
log("Node {child} is being revisited")
else if child.status==nil:
push(stack, child)
else if node.status==DISCOVERED:
process(parent)
node.status = PROCESSED
else if node.status==PROCESSED:
pass // do nothing
}
}
To solidify our understanding and intuition of this traversal, we'll go through a more descriptive approach, where we track the node on a stack, along with the intended next action. There are three possible actions nil
, BEGIN_DISCOVERY
, and END_PROCESSING
. In the recursive version of DFS, they respectively correspond to skipping a call to the dfs()
function when a node is already processed, entrance into the function where the node is first discovered, and exit from the function where the node is processed.
The latter two actions thus trigger a specific change of status on the node, which must be consistent with its current status. Its children nodes can then be tested to detect cycles and revisits, as seen previously.
iterative_dfs(starting_node):
stack = Stack()
push(stack, {starting_node, BEGIN_DISCOVERY})
while stack is not empty:
{node, next_action} = pop(stack)
if next_action is BEGIN_DISCOVERY:
node.status = DISCOVERED
push(stack, {node, END_PROCESSING})
for child in node.children:
if child.status is DISCOVERED:
exit("Error: cycle detected")
else if child.status is PROCESSED:
log("Node {child} is being revisited")
else if child.status is nil:
push(stack, {child, BEGIN_DISCOVERY})
else if next_action is END_PROCESSING:
after_children_processing(node)
node.status = PROCESSED
else if next_action is nil:
do nothing
The next_action
to apply has a 1:1 relationship with the current status of the node, and therefore can simply be inferred from it. If next_action
is:
BEGIN_DISCOVERY
, it implies that the node is currently undiscovered. That is,node.status
should currently benil
and changed toDISCOVERED
.END_PROCESSING
, it implies thatnode.status
should currently beDISCOVERED
and changed toPROCESSED
.nil
, it implies thatnode.status
should currently bePROCESSED
and no longer changed.
Substituting tests on next_action
with simply their node.status
equivalent, we get:
iterative_dfs(starting_node):
stack = Stack()
push(stack, starting_node)
while stack is not empty:
node = pop(stack)
if node.status is nil:
node.status = DISCOVERED
push(stack, node)
for child in node.children:
if child.status is DISCOVERED:
exit('Error: cycle detected')
else if child.status is PROCESSED:
log("Node {child} is being revisited")
else if child.status is nil:
push(stack, child)
else if node.status is DISCOVERED:
after_children_processing(node)
node.status = PROCESSED
else if node.status is PROCESSED:
do nothing
In the above, when a node is popped from the stack at the onset of a loop, its discovery status is verified.
If it's just about to be DISCOVERED
, its status is thus updated and the node is placed back on the stack. Its children are then iterated through and added to the stack when appropriate. This arrangement causes children nodes to be popped from the stack before the parent (during later iterations of the while
loop) and thus also processed first.
If the popped node is currently DISCOVERED
, it means that all its children were popped from the stack and processed. It can also be processed.
If the popped node is already PROCESSED
, it implies a simple revisiting of a fully discovered node, which can be ignored.
A slightly different but equivalent approach is to peek
on the stack (rather than popping) and only later, pop PROCESSED
nodes.
iterative_dfs(starting_node):
stack = Stack()
push(stack, starting_node)
while stack is not empty:
node = peek(stack) // just peeking, not popping.
if node.status is nil:
node.status = DISCOVERED
for child in node.children:
if child.status is DISCOVERED:
exit('Error: cycle detected')
else if child.status is PROCESSED:
log("Node {child} is being revisited")
else if child.status is nil:
push(stack, child)
else if node.status is DISCOVERED:
after_children_processing(node)
node.status = PROCESSED
else if node.status is PROCESSED:
pop(stack) // popping