Tech Interview – Improve the Algorithm

Adding new files to our database increases the processing time significantly. This is because the algorithm we use is not efficient. Let’s recap how it works. After we read the database in, we have a list of Person instances, and every Person instance maintains a list that tells which other people this person likes:

```public class Person
{
public string Name { get; set; }
public List<string> LikedPersons { get; set; }
}
```

If I want to know how many people like me, I need to go through the database searching my name in the LikedPersons lists:

```var persons = ReadAllPersonsFromDatabase();
var count = 0;

foreach (var candidate in persons)
{
if (candidate.LikedPersons.Contains(myName))
{
count++;
}
}
```

It works well if I want to know only that how many people like me. If I want to know this number for every other people, a simple idea is that I do the same loop for every other people. In this case the code looks like the following:

```foreach (var person in persons)
{
var count = 0;

foreach (var candidate in persons)
{
if (candidate.LikedPersons.Contains(person.Name))
{
count++;
}
}

// store count for a given person
}
```

This is the algorithm our LikeStatistics application uses. This is two loops, both iterate through the persons list. So it is an O(n*n) algorithm, which means that when we double the number of people, the algorithm becomes 4 times slower. On the top of that, there is an expensive operation in the middle of the loops, a linear search on the LikedPersons list. The length of the LikedPersons list will not grow with the size of the database (I will not like more people just because they bring in Alaska or Japan), so we can say it is a constant time to search that list through.

Now the question is that whether is possible to do better than O(n*n). Sometimes on an interview they say that one solution to decrease the processing time is to process the data parallel. Does it really help? Let’s say that we have a 4 core system and the parallel processing made the processing time 25% of the original, so we are happy. Then the number of people doubled. Because we use an O(n*n) algorithm, the processing time becomes not twice but four times longer. We are back at the original processing time. We buy more processors, we have 16 cores now, and the time went down to 25%. But the business goes well, the number of peoples doubles, and we are at the original processing time again. To go down to 25% we would need 64 cores now, after the next double we would need 256. We can see that it is not a solution long term, the processors needed grows much faster that the number of people we need to process. We need to find a better algorithm.

Let’s try a real life example, maybe it reveals a good way to count what we want. Let’s say that in a high school they want to know who the nicest person is. Everybody can vote, and they can give more votes, not just one, they can name every nice kid they know on the ballot paper.

How would we count the votes? According to the original algorithm, we would take the school directory, take the first name from it, go through the votes looking for the name. Then we would go with the next person from the directory again. Really this is what we would do? Of course, not!

What we would do is that we would take the vote papers one by one, and for every name we find on the vote papers we would maintain a tally sheet. The real like algorithm would be like this:

1. take a vote paper from the box, if no more, we finished
2. if no name left on the vote paper, back to 1), otherwise take a name
3. if we have seen this name before, find their tally sheet and add a new mark
4. if we have not seen this name, start a new tally sheet (or a new line on a tally sheet) for the person, and give a mark
5. back to 2)

This algorithm is really simple, and it goes through the votes only once! Let’s code it then.

As a first step, write a code that takes every people once and every name from the LikedPersons list only once:

```foreach (var person in persons)
{
foreach (var vote in person.LikedPersons)
{
...
}
}
```

It is a double loop again, but not all loops like this mean an O(n*n) algorithm. The inner loop goes through a different list, and the length of that doesn’t (really) depend on the number of people we have in the database. So as we increase the number of people, the outer loop becomes longer to process, but not the inner loop, it is only O(n)

Now we need to figure out how to register the votes in an efficient way. In real life we would have a sheet with the names of the people we have seen. If I see a name on a voting paper, I start searching my tally sheets. In a simple case, I have the already seen names in a random order and I need to look at them one by one. This slows down the processing, we are drifting back to O(n*n) again. If I want to do it better, then I maintain alphabetical order, that speeds up the search.

In case of a program, there is an even faster way to find the already seen names. It is a very common question on interviews that how can I find an item constant time in a collection, so it is worth to know it. If we want to find strings or other objects fast in a collection, a hash based data structure might be our solution. In case of .Net, this is the Dictionary (or HashSet) class.

When we are reading the LikedPersons list, and we see a name “John Doe”, we need to check the dictionary whether this name is there or not, and get the associated “tally sheet”. We already have a data structure representing the tally sheet, this is the PersonNameWithLikeCount:

```var counters = new Dictionary<string, PersonNameWithLikeCount>();

foreach (var person in persons)
{
foreach (var vote in person.LikedPersons)
{
PersonNameWithLikeCount personWithCounter;

// have we seen this name before?
if (!counters.TryGetValue(vote, out personWithCounter))
{
// No, it is new, open a new sheet for them and tally it
personWithCounter =
new PersonNameWithLikeCount()
{
Name = vote,
LikeCount = 1
};

counters[vote] = personWithCounter;
}
else
{
// otherwise just add a new mark
personWithCounter.LikeCount++;
}
}
}
```

This algorithm makes the processing much faster:

With this part we finished the refactoring interview. Good luck for a real one!

Tech Interview – Non-recursive Tree Traversal

In the previous part we extended the database structure and now it supports sub-directories. The algorithm that walks through the directory structure is based on recursive calls. Many times recursion results a nice program structure, but it also has its drawbacks. What if we want to process the data – like files from a directory structure – in a lazy way, one by one? What if we want to write an iterator around the algorithm? One problem with recursion is that the execution stack stores the state of the algorithm, and we cannot return to the caller method without destroying this state.

Today we will solve the directory problem in a different way, which will make it possible to implement an iterator around it. Once we’ve solved a problem one way, it is quite hard to come up a different way. To help getting around this mental paralysis let’s see how we would solve a similar problem in real life.

We want to collect file names, so let’s say that in the real life example we want to collect “things” because we need to use them on a daily basis. In the database we have a start directory, in the real life example we have a single address to start. When we visit this address, we get a bunch of “things”, and also we can get a list of addresses where we can find even more “things”.

What would we do when we needed more and more “things”? The real life algorithms would be something like the following:

1. At the beginning we have the single address
2. We wait until we need to use a “thing”
(when we need a thing:)
3. do we have “things” in stock? (in our pocket, a box, etc) if yes, continue at 8.
4. do we have any address to get more things? if no, it is over
5. go to one of the addresses we have, here we may get “things” and new addresses
6. Put the “things” into our pocket, box, etc
8. go to 2.
9. use one “thing”
10. back to 1.

This algorithm for the real life example is so obvious, we didn’t even need to think for a second. Still, when I ask the directory traversal problem on an interview, almost everybody comes up with the recursive solution. Probably I would have came up with the recursive version if I had been asked. Anyway, it is very easy to translate the real life algorithm to directory traversal. Instead of a pocket (box, etc) we will use a queue to store file names (“things”), and the directory list (“address list”) will also be a queue:

1. Put the root directory name into a queue
2. Wait till a file name will be requested (by using the iterator)
3. Does the file name queue have any file names in it? If yes, go to 8.
4. If the directory name queue empty, the algorithm ends
5. get a directory name from the queue
6. read all file names from the directory and push them into the file name queue
7. read all sub-directory names and push the directory names into the directory queue
8. go to 2.
9. return a file name from the file name queue
10. go to 1.

Enumerators Again

We have already created enumerators, the Parser classes became enumerators after the second refactor. When we implemented the enumerator classes for the parsers, we implemented IDisposable, also the IEnumerable derived class (the FileBaseDataset class) used a Factory class to create the enumerators according to some parameters.

In most cases we need much simpler enumerators. We do not need to implement IDisposable, also the Enumerable class would just return a single enumerator type.

C# generated enumerators

For those simple cases, when the enumerable class creates the same enumerator type, and no Dispose() implementation is needed, the C# compiler can help us creating enumerators. Why would we need any help? Because the structure of the enumerators is twisted, doesn’t really fit to the structure we usually use for traversing items. When we are traversing items, we usually use some sort of loop, and inside the loop we take the next item and process it.
When we want to implement an enumerator, we need to chop this loop the way that we can return the item from the loop as a result of the IEnumerator::MoveNext() call. And when the MoveNext() gets called again, we need to find a way to continue our loop at the place where we stopped last time. Sometimes it is easier, sometimes it’s harder. But how can the C# compiler help us?

Let’s see the algorithm we described above, when it is not encoded into an IEnumerator::MoveNext(). Let’s just print the file names to the console. In this case we can use a normal loop:

```public void DumpFiles(string root, string fileType)
{
Queue<string> fileQueue = new Queue<string>(); // "pocket" for the things
Queue<string> dirQueue = new Queue<string>();  // the "address list"

// we work till we have "things" in our poclet, or have address we haven't visited
while (dirQueue.Count > 0 || fileQueue.Count > 0)
{
// do we have "things" in our pocket?
if (fileQueue.Count > 0)
{
// pick one and print to console
Console.WriteLine(fileQueue.Dequeue());
}
else
{
// otherwise we need to visit an address
var dirToProcess = dirQueue.Dequeue();

// put all the "things" into our pocket
Array.ForEach(
FileSystem.GetFiles(dirToProcess, fileType),
fileQueue.Enqueue);

Array.ForEach(
FileSystem.GetDirectories(dirToProcess),
dirQueue.Enqueue);
}
}
}
```

It was easy to implement the algorithm this way. Now let’s think a minute about what we would need to do if we wanted to implement it in an IEnumerator::MoveNext(). Where we have the Console.WriteLine(), we would need to return after we stored the current file name in the Enumerator.Current property. And when the method gets called again, we needed to find the way back to the middle of the loop, also we needed to restore the state of the loop, if it had any. Maybe it is not a big deal in this particular example, but even easier to use the C# compiler to figure this out for us:

```internal static class FileBasedDatabase
{
internal static IEnumerable<string> Files(string root, string fileType)
{
Queue<string> fileQueue = new Queue<string>();
Queue<string> dirQueue = new Queue<string>();

dirQueue.Enqueue(root);

while (dirQueue.Count > 0 || fileQueue.Count > 0)
{
if (fileQueue.Count > 0)
{
yield return fileQueue.Dequeue();
}
else
{
var dirToProcess = dirQueue.Dequeue();

Array.ForEach(
FileSystem.GetFiles(dirToProcess, fileType),
fileQueue.Enqueue);

Array.ForEach(
FileSystem.GetDirectories(dirToProcess),
dirQueue.Enqueue);
}
}
}
}
```

What is happening? We changed the Console.WriteLine() to a yield return. This is how we tell the C# compiler that that is the point we want to break the loop at, that is the value we want to provide as the current item, and that is the point where we want to come back to when MoveNext() gets called again.

The C# compiler will restructure the code to support everything that we expect from an iterator. But it does even more. The iterator design pattern has an Aggregate (Enumerable) and an Iterator (Enumerator). We don’t need to implement any of those. The C# compiler generates every class needed by the .NET version of the iterator pattern. It creates an IEnumerable implementation (the aggregator), this class will be returned when we call FileBasedDatabase.Files(). This class will have a GetEnumerator() method (needed by IEnumerable) which returns an other generated class, an IEnumerator implementation (the iterator). This class holds our original code in a restructured way. It is easier to understand all of this from the following diagram:

```var files = FileBasedDatabase.Files(@"c:\temp", ".txt");