In the previous part we extended the database structure and now it supports sub-directories. The algorithm that walks through the directory structure is based on recursive calls. Many times recursion results a nice program structure, but it also has its drawbacks. What if we want to process the data – like files from a directory structure – in a lazy way, one by one? What if we want to write an iterator around the algorithm? One problem with recursion is that the execution stack stores the state of the algorithm, and we cannot return to the caller method without destroying this state.

Today we will solve the directory problem in a different way, which will make it possible to implement an iterator around it. Once we’ve solved a problem one way, it is quite hard to come up a different way. To help getting around this mental paralysis let’s see how we would solve a similar problem in real life.

We want to collect file names, so let’s say that in the real life example we want to collect “things” because we need to use them on a daily basis. In the database we have a start directory, in the real life example we have a single address to start. When we visit this address, we get a bunch of “things”, and also we can get a list of addresses where we can find even more “things”.

What would we do when we needed more and more “things”? The real life algorithms would be something like the following:

  1. At the beginning we have the single address
  2. We wait until we need to use a “thing”
    (when we need a thing:)
  3. do we have “things” in stock? (in our pocket, a box, etc) if yes, continue at 8.
  4. do we have any address to get more things? if no, it is over
  5. go to one of the addresses we have, here we may get “things” and new addresses
  6. Put the “things” into our pocket, box, etc
  7. Write the new addresses to our address list (and cross over the current address)
  8. go to 2.
  9. use one “thing”
  10. back to 1.

This algorithm for the real life example is so obvious, we didn’t even need to think for a second. Still, when I ask the directory traversal problem on an interview, almost everybody comes up with the recursive solution. Probably I would have came up with the recursive version if I had been asked. Anyway, it is very easy to translate the real life algorithm to directory traversal. Instead of a pocket (box, etc) we will use a queue to store file names (“things”), and the directory list (“address list”) will also be a queue:

  1. Put the root directory name into a queue
  2. Wait till a file name will be requested (by using the iterator)
  3. Does the file name queue have any file names in it? If yes, go to 8.
  4. If the directory name queue empty, the algorithm ends
  5. get a directory name from the queue
  6. read all file names from the directory and push them into the file name queue
  7. read all sub-directory names and push the directory names into the directory queue
  8. go to 2.
  9. return a file name from the file name queue
  10. go to 1.

Enumerators Again

We have already created enumerators, the Parser classes became enumerators after the second refactor. When we implemented the enumerator classes for the parsers, we implemented IDisposable, also the IEnumerable derived class (the FileBaseDataset class) used a Factory class to create the enumerators according to some parameters.

In most cases we need much simpler enumerators. We do not need to implement IDisposable, also the Enumerable class would just return a single enumerator type.

C# generated enumerators

For those simple cases, when the enumerable class creates the same enumerator type, and no Dispose() implementation is needed, the C# compiler can help us creating enumerators. Why would we need any help? Because the structure of the enumerators is twisted, doesn’t really fit to the structure we usually use for traversing items. When we are traversing items, we usually use some sort of loop, and inside the loop we take the next item and process it.
When we want to implement an enumerator, we need to chop this loop the way that we can return the item from the loop as a result of the IEnumerator::MoveNext() call. And when the MoveNext() gets called again, we need to find a way to continue our loop at the place where we stopped last time. Sometimes it is easier, sometimes it’s harder. But how can the C# compiler help us?

Let’s see the algorithm we described above, when it is not encoded into an IEnumerator::MoveNext(). Let’s just print the file names to the console. In this case we can use a normal loop:

public void DumpFiles(string root, string fileType)
{       
    Queue<string> fileQueue = new Queue<string>(); // "pocket" for the things
    Queue<string> dirQueue = new Queue<string>();  // the "address list"
 
    dirQueue.Enqueue(root);     // write the first address to the address list
 
    // we work till we have "things" in our poclet, or have address we haven't visited
    while (dirQueue.Count > 0 || fileQueue.Count > 0)
    {
        // do we have "things" in our pocket?
        if (fileQueue.Count > 0)
        {
            // pick one and print to console
            Console.WriteLine(fileQueue.Dequeue());
        }
        else
        {
            // otherwise we need to visit an address
            var dirToProcess = dirQueue.Dequeue();
 
            // put all the "things" into our pocket
            Array.ForEach(
                FileSystem.GetFiles(dirToProcess, fileType),
                fileQueue.Enqueue);
 
            // and write the new addresses to the address list
            Array.ForEach(
                FileSystem.GetDirectories(dirToProcess),
                dirQueue.Enqueue);
        }
    }
}

It was easy to implement the algorithm this way. Now let’s think a minute about what we would need to do if we wanted to implement it in an IEnumerator::MoveNext(). Where we have the Console.WriteLine(), we would need to return after we stored the current file name in the Enumerator.Current property. And when the method gets called again, we needed to find the way back to the middle of the loop, also we needed to restore the state of the loop, if it had any. Maybe it is not a big deal in this particular example, but even easier to use the C# compiler to figure this out for us:

internal static class FileBasedDatabase
{           
    internal static IEnumerable<string> Files(string root, string fileType)
    {       
        Queue<string> fileQueue = new Queue<string>();
        Queue<string> dirQueue = new Queue<string>();
 
        dirQueue.Enqueue(root);
 
        while (dirQueue.Count > 0 || fileQueue.Count > 0)
        {
            if (fileQueue.Count > 0)
            {
                yield return fileQueue.Dequeue();
            }
            else
            {
                var dirToProcess = dirQueue.Dequeue();
 
                Array.ForEach(
                    FileSystem.GetFiles(dirToProcess, fileType),
                    fileQueue.Enqueue);
 
                Array.ForEach(
                    FileSystem.GetDirectories(dirToProcess),
                    dirQueue.Enqueue);
            }
        }
    }
}

What is happening? We changed the Console.WriteLine() to a yield return. This is how we tell the C# compiler that that is the point we want to break the loop at, that is the value we want to provide as the current item, and that is the point where we want to come back to when MoveNext() gets called again.

The C# compiler will restructure the code to support everything that we expect from an iterator. But it does even more. The iterator design pattern has an Aggregate (Enumerable) and an Iterator (Enumerator). We don’t need to implement any of those. The C# compiler generates every class needed by the .NET version of the iterator pattern. It creates an IEnumerable implementation (the aggregator), this class will be returned when we call FileBasedDatabase.Files(). This class will have a GetEnumerator() method (needed by IEnumerable) which returns an other generated class, an IEnumerator implementation (the iterator). This class holds our original code in a restructured way. It is easier to understand all of this from the following diagram:

Misleading to debug

The Files() method we wrote above will be restructured, and this can lead to some confusion when we want to debug the code. What happens when I press F5 with the break point in Files()?

DebugGenEnum

The execution will not be broken, even if we call Files() and there is a break point right in the first line. Why not? We can understand this if we look at the figure that shows how the code are moved into the generated classes. We call Files() but now it contains a generated code only. The code with the break point is now in the MoveNext() method of a generated class. If we want to hit the break point, we need to call MoveNext() first. This code will stop at the break point:

var files = FileBasedDatabase.Files(@"c:\temp", ".txt");
var enumerator = files.GetEnumerator();
enumerator.MoveNext();