Adding new files to our database increases the processing time significantly. This is because the algorithm we use is not efficient. Let’s recap how it works. After we read the database in, we have a list of Person instances, and every Person instance maintains a list that tells which other people this person likes:

public class Person
{
    public string Name { get; set; }
    public List<string> LikedPersons { get; set; }
}

If I want to know how many people like me, I need to go through the database searching my name in the LikedPersons lists:

var persons = ReadAllPersonsFromDatabase();
var count = 0;

foreach (var candidate in persons)
{
    if (candidate.LikedPersons.Contains(myName))
    {
        count++;
    }
}

It works well if I want to know only that how many people like me. If I want to know this number for every other people, a simple idea is that I do the same loop for every other people. In this case the code looks like the following:

foreach (var person in persons)
{
    var count = 0;
 
    foreach (var candidate in persons)
    {
        if (candidate.LikedPersons.Contains(person.Name))
        {
            count++;
        }
    }

    // store count for a given person
}

This is the algorithm our LikeStatistics application uses. This is two loops, both iterate through the persons list. So it is an O(n*n) algorithm, which means that when we double the number of people, the algorithm becomes 4 times slower. On the top of that, there is an expensive operation in the middle of the loops, a linear search on the LikedPersons list. The length of the LikedPersons list will not grow with the size of the database (I will not like more people just because they bring in Alaska or Japan), so we can say it is a constant time to search that list through.

Now the question is that whether is possible to do better than O(n*n). Sometimes on an interview they say that one solution to decrease the processing time is to process the data parallel. Does it really help? Let’s say that we have a 4 core system and the parallel processing made the processing time 25% of the original, so we are happy. Then the number of people doubled. Because we use an O(n*n) algorithm, the processing time becomes not twice but four times longer. We are back at the original processing time. We buy more processors, we have 16 cores now, and the time went down to 25%. But the business goes well, the number of peoples doubles, and we are at the original processing time again. To go down to 25% we would need 64 cores now, after the next double we would need 256. We can see that it is not a solution long term, the processors needed grows much faster that the number of people we need to process. We need to find a better algorithm.

Counting votes

Let’s try a real life example, maybe it reveals a good way to count what we want. Let’s say that in a high school they want to know who the nicest person is. Everybody can vote, and they can give more votes, not just one, they can name every nice kid they know on the ballot paper.

How would we count the votes? According to the original algorithm, we would take the school directory, take the first name from it, go through the votes looking for the name. Then we would go with the next person from the directory again. Really this is what we would do? Of course, not!

What we would do is that we would take the vote papers one by one, and for every name we find on the vote papers we would maintain a tally sheet. The real like algorithm would be like this:

  1. take a vote paper from the box, if no more, we finished
  2. if no name left on the vote paper, back to 1), otherwise take a name
  3. if we have seen this name before, find their tally sheet and add a new mark
  4. if we have not seen this name, start a new tally sheet (or a new line on a tally sheet) for the person, and give a mark
  5. back to 2)

This algorithm is really simple, and it goes through the votes only once! Let’s code it then.

As a first step, write a code that takes every people once and every name from the LikedPersons list only once:

foreach (var person in persons)
{
  foreach (var vote in person.LikedPersons)
  {
    ...
  }
}

It is a double loop again, but not all loops like this mean an O(n*n) algorithm. The inner loop goes through a different list, and the length of that doesn’t (really) depend on the number of people we have in the database. So as we increase the number of people, the outer loop becomes longer to process, but not the inner loop, it is only O(n)

Now we need to figure out how to register the votes in an efficient way. In real life we would have a sheet with the names of the people we have seen. If I see a name on a voting paper, I start searching my tally sheets. In a simple case, I have the already seen names in a random order and I need to look at them one by one. This slows down the processing, we are drifting back to O(n*n) again. If I want to do it better, then I maintain alphabetical order, that speeds up the search.

In case of a program, there is an even faster way to find the already seen names. It is a very common question on interviews that how can I find an item constant time in a collection, so it is worth to know it. If we want to find strings or other objects fast in a collection, a hash based data structure might be our solution. In case of .Net, this is the Dictionary (or HashSet) class.

When we are reading the LikedPersons list, and we see a name “John Doe”, we need to check the dictionary whether this name is there or not, and get the associated “tally sheet”. We already have a data structure representing the tally sheet, this is the PersonNameWithLikeCount:

var counters = new Dictionary<string, PersonNameWithLikeCount>();
 
foreach (var person in persons)
{
    foreach (var vote in person.LikedPersons)
    {
        PersonNameWithLikeCount personWithCounter;
 
        // have we seen this name before?
        if (!counters.TryGetValue(vote, out personWithCounter))
        {
            // No, it is new, open a new sheet for them and tally it
            personWithCounter =
                new PersonNameWithLikeCount()
                {
                    Name = vote,
                    LikeCount = 1
                };
 
            counters[vote] = personWithCounter;
        }
        else
        {
            // otherwise just add a new mark
            personWithCounter.LikeCount++;
        }
    } 
}

This algorithm makes the processing much faster:

With this part we finished the refactoring interview. Good luck for a real one!