LateX

Wednesday, July 27, 2011

Algorithms on strings - XVIII Polish CS Olympiad - Difference (4/10)

This was quite a fun question. Although solution turned out to be simple it was a little bit unconventional and it took me a couple of minutes to work out the solution.
You can find the english version of the text as well as the checker here:


So first natural approach to this problem is to analyse all subsequences separately. There are $\Theta(n^2)$ of them. For each sequence we just count for each letter how many times it occurs in this sequence and then find a minimum and maximum. As sequence can have pessimistically about $\Theta(n)$ length, the whole algorithm complexity is $\Theta(n^3)$. This is way is sufficient for short sequence but way to slow for $1000000$ element sequence.

The Maximum Sum contiguous subsequence problem

You should probably know this problem before you proceed. If you already know it you can skip this section. The problem is the following:
You are given the sequence of integers (can be negative). Find such a contiguous subsequence, that sum of it's elements is maximum possible.
For example if we consider sequence
2 1 -5 4 -2 4 -1 -2 -3 5 
The answer is subseqence 4 -2 4 having sum 6.

Now assuming we have $n$ elements this problem can be solved in $\Theta(n)$. At this point you might want to try to work it out on your own.

As this problem is quite well described in the internet I won't give exhaustive description. Here's how you can approach it. As the target algorithm is $\Theta(n)$ it's sort of a clue that we scan the sequence only once. OK, let's try left to right maybe this will work out. The simplest idea that comes to my head is to calculate sum of the prefixes and see what happens.

Seq.21-54-24-1-2-35
Sum23-2       
[Best so far: 5]

And look what happened now! We have negative prefix sum. This means that we don't want to add this prefix (2 1 -5) to our final result (there's no benefit at all!). But what about (1 -5) or (-5) ? Both of them sums to negative number and therefore are worthless. So let's totally forget about first three numbers. We will indicate that in our calculations by replacing -2 with 0 in our table.

Seq.21-54-24-1-2-45
Sum23042653-1 
[Best so far: 6]

Ok now it happened again. In case somebody's lost we are currently considering subsequence (4 -2 4 -1 -2 -4). Now again let's notice that however many terms you strip from the beginning of this sequence we will never exceed -1. Let's forget about all of them then analogous to above.

Seq.21-54-24-1-2-45
Sum2304265305
[Best averally: 6]

Voille! The result is 6.
But hold on! How did it happened? Each time we were able to forget about whole prefix totally. What if it was say 1 -3 -3 4. It sums to -1 so we would get rid of all of it. But we can strip first three terms and get 4 alone, which would be quite useful in our future calculations. What now? Is the algorithm wrong? Did we just choose lucky example above?
No! Think about it! Here's the trick! We would never be considering such a prefix in our algorithm! After reading 1 and then -3 we would get rid of them (because they sum to negative value) and passed on 0. Then we read -3, which we also get rid of and pass on 0. Then we read 4 and pass it on. Hey, it works!


This way we ended up with linear time algorithm. We just do one pass and always get rid of currently considered terms once we get a negative sum. Also we don't have to think about the terms explicitly. Just change the variable holding sum so far to 0 once it's negative. Simple, isn't it?

// sequence contains 1-based n-element sequence. 
int res = 0, best = 0; 
for (int i = 1; i <= n; ++i) {
  res += sequence[i];
  if(res > best) best = res;   
  if(res < 0) res = 0;
}
For geeks: More formally we can prove this algorithm is correct inductively on the sequence length. The hypothesis is that after ith iteration of loop variable res contains maximum sum that subsequence (may be empty) finishing at i-th index can give. This indeed true. Assuming it's true for i-1. Now assume that to get better result for i we need to use other sequence that the one from i-1. Then it would be better for i-1 as well if we stripped the last term. Contradicion with the fact that we had optimal answer for i-1. Also we didn't take into account that in particular we can just have an empty sequence. It gives 0 so it's worth doing only if otherwise we would get negative number.

Back to reality now - the problem is quite complicated so let's solve simpler version first. Let's assume that magic fairy told us that in the subsequence which is the answer the letter that occurs the most is 'b' and the letter that occurs the least is 'a'. Problem is to calculate such a subsequence that difference between number of occurrences of 'b' and 'a' (if there are more 'a' then it would be negative in this case) is maximum possible. Now - can we do that?

I seriously recommend that you try to work it out on your own given the solution to the problem in yellow box before continuing. It's important at least to try, you don't have to succeed!
Here's the solution. Let's call the letter that occurs the biggest number of times the big letter and the one that occurs the least number of times the small letter . Hence, each big letter is sort of like +1, right? Each small letter is like a -1. What about rest? Completely unimportant so let it be 0. Now we just have to run the algorithm for maximum sum subsequence, right? Not quite, but close. If we read the question in detail we will notice the sentence "We are assuming that the least frequent letter has to occur at least once in the resulting fragment". So we can just check if in current sequence there is such a letter. The code would look sort of like this.

int calculate_maximum_difference(char small_letter, char big_letter) {   
  int res = 0, best = 0;   
  bool has_small = false;   
  for(int i = 0; i < n; ++i) {
      if (sequence[i] == small_letter) {
        --res;
        has_small = true;
        if(res < 0) {
          has_small = false;
          res = 0;
        } 
      } else if (sequence[i] == big_letter) {
        ++res;
      }
      if(best < res && has_small) best = res;
  }
  return best;
}
OK, so this code have to work! No? Of course not. Consider example:
baaaa
where b is small and a is big. Now we read one b then get rid of it, then read 4 a's but make no use out of them because we do not have a small letter. So sometimes it might be worth keeping the a letter even if it subtracts one from final result. But on the other hand in this example:
baaaabaaaaa
We definitely don't want to consider first b. So when to do what? I don't care! Let's just test each case separately. Our function will have additional boolean parameter discard, that will determine whether this time we are discarding the small letter or not. OK great, let's change this to general problem solution! We just consider each pair of letters. The worst that could happen if we consider wrong pair is that we will get result worse than optimal, but we don't really care because at some point we will encounter the optimal result. The full code of this solution is presented below:

#include <cstdio> 
#include <vector> 
#include <algorithm> 
using namespace std;  

int n; 
char sequence[1000006]; // always add some more just in case :)  

int calculate_maximum_difference(char small_letter, 
                                 char big_letter, 
                                 bool discard) {   
  int res = 0, best = 0;   
  bool has_small = false;   
  for(int i = 0; i < n; ++i) {
    if (sequence[i] == small_letter) {
      --res; has_small = true;
      if(discard) {
        if(res<0) { res = 0; has_small = false; }
      } else {
        if(res<-1) { res = -1; }
      }
    } else if (sequence[i] == big_letter) {
      ++res;
    }
    if(best < res && has_small) best = res;
  }
  return best;
}


int main() {
  scanf("%d", &n);
  scanf(" %s", sequence); // one of the fastest ways to read a string
  int best = 0;
  for (char i = 'a'; i <= 'z'; ++i) {
    for(char  j = 'a'; j <= 'z'; ++j) {
      if(i==j) continue;
      best = max(max(calculate_maximum_difference(i, j, true),
                 calculate_maximum_difference(i, j, false)), 
             best);
    }
  }
  printf("%d\n", best);
}
This is working solution. The only disadvantage of this solution is that it's to slow to satisfy the task requirements (it get's 60/100 points in tester). Let's think about a complexity. Let's denote the alphabet size by $\Omega$. For each unordered pair we do single scan through entire sequence. This is $\Theta(\Omega^2 n)$. It's quite a lot. Looking at the actual numbers ($26 * 26 * 1 000 000 = 676 000 000$). Another optimization is quite straightforward. Guessing the appropriate pair of letters sound difficult and non-trivial, so let's think what we can actually do to make calculate_maximum_difference run faster. So what's the actual thing? Oh, come on it's easy. Thinking... Thinking... ... Got it! We can remove every letter from considered string that is nor small, neither big. For example if a is big and b is small then string:
abcsdsfbabgfabgab 
is equivalent to:
abbababab 
But how to do it efficient? We can just store for each letter it's indexes in the input string. Then we can create strings looking just at the two necessary letters rather than whole string. So after another bit of thinking we end up with the code:

#include <cstdio> 
#include <vector> 
#include <algorithm> 
#include <string> 
using namespace std;  

int n; 
char sequence[1000006]; // always add some more just in case :) 
vector<int> letter_positions[256]; // letter_positions['a'] is 
                                         // a vector of all the indexes 
                                         // at which letter 'a' occurs 
                                         // in big string.  
int calculate_maximum_difference(const string& text, 
                                 char small_letter, 
                                 char big_letter, 
                                 bool discard) {   
  int res = 0, best = 0;   
  bool has_small = false;   
  for(int i = 0; i < text.size(); ++i) {
      if (text[i] == small_letter) {
        --res; has_small = true;
        if(discard) {
          if(res<0) { res = 0; has_small = false; }
        } else {
          if(res<-1) { res = -1; }
        }
      } else {
        ++res;
      }
      if(best < res && has_small) best = res;
  }
  return best;
}


int main() {
  scanf("%d", &n);
  scanf(" %s", sequence); // one of the fastest ways to read a string
  for(int i=0; i < n; ++i) 
    letter_positions[sequence[i]].push_back(i);
  int best = 0;
  for (char i = 'a'; i <= 'z'; ++i) {
    for(char  j = 'a'; j <= 'z'; ++j) {
      if(i==j) continue;
      string text; // we will store a text that only contains i and j 
                   // letters preserving the original order
      vector<int>::iterator small = letter_positions[i].begin(), 
                                  big = letter_positions[j].begin();
      while(small != letter_positions[i].end() || 
            big != letter_positions[j].end()) {
        if(big == letter_positions[j].end() ||
           (small != letter_positions[i].end() && *small < *big)) {
          text.push_back(i); ++small;
        } else { 
          text.push_back(j); ++big;
        }
      }
      best = max(max(calculate_maximum_difference(text, i, j, true),
                 calculate_maximum_difference(text, i, j, false)), 
             best);
    }
  }
  printf("%d\n", best);
}
This time when we send it, we wait, wait, wait and yes!!!!! It gets a 100/100. But why? Was the optimization that good? What is the complexity now? Hmm... hard to tell. But intuitively, the biggest amount of work will be done if we do substantial amount of work for each pair of letter. How to construct such a case? Let's have equally many letters of each kind. So we have $\Theta(n/ \Omega)$ letters of each kind. For each of $\Omega^2$ pairs we do approximately $\Theta(n/ \Omega)$ operations. So the final complexity is $\Theta(\Omega n)$. That's about $26000000$ operations, which is not that bad, assuming we keep our code tidy. So yeah, it works! Yeey!

Bugs in my mind: (in this section i describe the problems I had while solving this problem):
  • First of all it was hard to came up with the right ideas. I kept thinking about standard approaches to such problems: binary search length of the substring, crawl along it, do some other standard techniques. Only after about 20 minutes it stroke me that we actually can do sth for each pair and consider only relevant letters ant there for decrease complexity by a factor of $\Omega$. Then rest followed.
  • I also spend quite some time on getting the discarding/non-discarding to work. I was trying some more general approaches and each of them failed. Now it seems trivial, but it took me about an half an hour of tries and looking at the tests that I kept getting wrong answer on to actually fix that.
  • The last very, very stupid error I made that was about 15 minute to debug is that I created vector instead of vector initially.

1 comment: