But it doesn't have to end here! Sign up for the 7-day coding interview crash course and you'll get a free Interview Cake problem every week.
You're in!
Find a duplicate, Space Edition™.
We have an array of integers, where:
It follows that our array has at least one integer which appears at least twice. But it may have several duplicates, and each duplicate may appear more than twice.
Write a function which finds an integer that appears more than once in our array. (If there are multiple duplicates, you only need to find one of them.)
We're going to run this function on our new, super-hip MacBook Pro With Retina Display™. Thing is, the damn thing came with the RAM soldered right to the motherboard, so we can't upgrade our RAM. So we need to optimize for space!
We can do this in space.
We can do this in less than time while keeping space.
We can do this in time and space.
We can do this without destroying the input.
Most algorithms double something or cut something in half. How can we rule out half of the numbers each time we iterate through the array?
This one's a classic! We just do one walk through the array, using a set to keep track of which items we've seen!
Bam. time and ... space ...
Right, we're supposed to optimize for space. is actually kinda high space-wise. Hm. We can probably get ...
We can "brute force" this by taking each number in the range 1..n and, for each, walking through the array to see if it appears twice.
This is space and time.
That space complexity can't be beat, but the time cost seems a bit high. Can we do better?
One way to beat time is to get time. Sorting takes time. And if we sorted the array, any duplicates would be right next to each-other!
But if we start off by sorting our array we'll need to take space to store the sorted array...
...unless we sort the input array in place!
Okay, so this'll work:
This'll keep us at space and bring us down to time.
But destroying the input is kind of a drag—it might cause problems elsewhere in our code. Can we maintain this time and space cost without destroying the input?
Let's take a step back. How can we break this problem down into subproblems?
If we're going to do time, we'll probably be iteratively doubling something or iteratively cutting something in half. That's how we usually get a "\lg{n}". So what if we could cut the problem in half somehow?
Well, binary search works by cutting the problem in half after figuring out which half of our input array holds the answer.
But in a binary search, the reason we can confidently say which half has the answer is because the array is sorted. For this problem, when we cut our unsorted array in half we can't really make any strong statements about which elements are in the left half and which are in the right half.
What if we could cut the problem in half a different way, other than cutting the array in half?
With this problem, we're looking for a needle (a repeated number) in a haystack (array). What if instead of cutting the haystack in half, we cut the set of possibilities for the needle in half?
The full range of possibilities for our needle is 1..n. How could we test whether the actual needle is in the first half of that range (1..\frac{n}{2}) or the second half (\frac{n}{2}+1..n)?
A quick note about how we're defining our ranges: when we take \frac{n}{2} we're doing integer division, so we throw away the remainder. To see what's going on, we should look at what happens when n is even and when n is odd:
So we can notice a few properties about our ranges:
So, how do we know if the needle is in 1..\frac{n}{2} or \frac{n}{2}+1..n?
Think about the original problem statement. We know that we have at least one repeat because there are n+1 items and they are all in the range 1..n, which contains only n distinct integers.
This notion of "we have more items than we have possibilities, so we must have at least one repeat" is pretty powerful. It's sometimes called the pigeonhole principle. Can we exploit the pigeonhole principle to see which half of our range contains a repeat?
Imagine that we separated the input array into two subarrays—one containing the items in the range 1..\frac{n}{2} and the other containing the items in the range \frac{n}{2}+1..n.
Each subarray has a number of elements as well as a number of possible distinct integers (that is, the length of the range of possible integers it holds).
Given what we know about the number of elements vs the number of possible distinct integers in the original input array, what can we say about the number of elements vs the number of distinct possible integers in these subarrays?
The sum of the subarrays' numbers of elements is n+1 (the number of elements in the original input array) and the sum of the subarrays' numbers of possible distinct integers is n (the number of possible distinct integers in the original input array).
Since the sums of the subarrays' numbers of elements must be 1 greater than the sum of the subarrays' numbers of possible distinct integers, one of the subarrays must have at least one more element than it has possible distinct integers.
Not convinced? We can prove this by contradiction. Suppose neither array had more elements than it had possible distinct integers. In other words, both arrays have at most the same number of items as they have distinct possibilities. The sum of their numbers of items would then be at most the total number of possibilities across each of them, which is n. This is a contradiction—we know that our total number of items from the original input array is n+1, which is greater than n.
Now that we know one of our subarrays has 1 or more items more than it has distinct possibilities, we know that subarray must have at least one duplicate, by the same pigeonhole argument that we use to know that the original input array has at least one duplicate.
So once we know which subarray has the count higher than its number of distinct possibilities, we can use this same approach recursively, cutting that subarray into two halves, etc, until we have just 1 item left in our range.
Of course, we don't need to actually separate our array into subarrays. All we care about is how long each subarray would be. So we can simply do one walk through the input array, counting the number of items that would be in each subarray.
Can you formalize this in code?
Careful—if we do this recursively, we'll incur a space cost in the call stack! Do it iteratively instead.
Our approach is similar to a binary search, except we divide the range of possible answers in half at each step, rather than dividing the array in half.
space and time.
Tricky as this solution is, we can actually do even better, getting our runtime down to while keeping our space cost at . The solution is NUTS; it's probably outside the scope of what most interviewers would expect. But for the curious...here it is!
This function always returns one duplicate, but there may be several duplicates. Write a function that returns all duplicates.
Our answer was a modified binary search. We got there by reasoning about the expected runtime:
Starting with a target runtime and working backward from there can be a powerful strategy for all kinds of coding interview questions.
Reset editor
Powered by qualified.io