Saturday, January 30, 2016

Median of integers stream

Question: Find the median from stream of unsorted integers at any given point of time. We will be receiving a continuous stream of numbers in some random order and we don’t know the stream length in advance. Write a function that finds the median of the already received numbers efficiently at any time. We will be asked to find the median multiple times. Just to recall, median is the middle element in an odd length sorted array, and the average of the middle elements in the even length case.
Answer:
Ans 1: Maintain sorted order of integers (make use of insertion sort technique). We can give the median as the middle element if the array length is odd, or the average of middle elements if the length is even. This can be done in O(1) time, but it takes O(N) time to keep the array sorted after inserting an element. We may need to shift the elements to the right, which will take O(N) time in worst case. Even though finding the position to insert the number takes O(logN) time, the overall insertion complexity is O(N) due to shifting.
Ans 2: let’s use a balanced binary search tree to avoid worst case behaviour of standard binary search trees. In a height balanced binary search tree (i.e. AVL tree) the balance factor is the difference between the heights of left and right subtrees. A node with balance factor 0, +1, or -1 is considered to be balanced. However, in our tree the balance factor won’t be height, it is the number of nodes in the left subtree minus the number of nodes in the right subtree. And only the nodes with balance factor of +1 or 0 are considered to be balanced. So, the number of nodes on the left subtree is either equal to or 1 more than the number of nodes on the right subtree, but not less. If we ensure this balance factor on every node in the tree, then the root of the tree is the median, if the number of elements is odd. In the even case, the median is the average of the root and its inorder successor, which is the leftmost descendent of its right subtree. So, complexity of insertion maintaining balance condition is O(logN) and find median operation is O(1) assuming we calculate the inorder successor of the root at every insertion if the number of nodes is even. Insertion and balancing is very similar to AVL trees. Instead of updating the heights, we update the number of nodes information. Balanced binary search trees seem to be the most optimal solution as of now, insertion is O(logN) and find median is O(1).
Ans 3: We can achieve the same complexity with a simpler and more elegant solution. We will use 2 heaps simultaneously, a max-heap and a min-heap with 2 requirements. The first requirement is that the max-heap contains the smallest half of the numbers and min-heap contains the largest half. So, the numbers in max-heap are always less than or equal to the numbers in min-heap. Let’s call this the order requirement. The second requirement is that, the number of elements in max-heap is either equal to or 1 more than the number of elements in the min-heap. So, if we received 2N elements (even) up to now, max-heap and min-heap will both contain N elements. Otherwise, if we have received 2N+1 elements (odd), max-heap will contain N+1 and min-heap N. Let’s call this the size requirement.
The max-heap for the values below the median, and a min-heap for the values above the median. When a new value arrives it is placed in the below heap (max-heap i.e. 1st half) if the value is less than or equal to the median, otherwise it is placed into the above heap (min-heap i.e. 2nd half). The heap sizes can be equal or the below heap (max-heap) has one extra. This constraint can easily be restored by shifting an element from one heap to the other. once we’re asked for the median, if the total number of received elements is odd, the median is the root of the max-heap. If it’s even, then the median is the average of the roots of the max-heap and min-heap. The median is available in constant time O(1) and insertion complexity is O(logN), which is the insertion complexity of a heap.

Sol#1

public static PriorityQueue<Integer> minHeap = new PriorityQueue<Integer>();
public static PriorityQueue<Integer> maxHeap = new PriorityQueue<Integer>(Collections.reverseOrder());
public static void addNum(int x) {
if(minHeap.size()==0 || x >= minHeap.peek()) {
minHeap.add(x);
} else {
maxHeap.add(x);
}
if(minHeap.size()>maxHeap.size()+1) {
maxHeap.add(minHeap.remove());
}
else if (maxHeap.size()>minHeap.size()) {
minHeap.add(maxHeap.remove());
}
}
public static Double findMedian() {
if(minHeap.size()>maxHeap.size()) {
return Double.valueOf(minHeap.peek());
}
else {
return (Double.valueOf(maxHeap.peek())+Double.valueOf(minHeap.peek()))/2;
}
}

public static void main(String[] args) {
// TODO Auto-generated method stub
addNum(1);
addNum(2);
System.out.println(findMedian()); // 1.5
addNum(3); 
System.out.println(findMedian()); //  2
}

Sol#2
 public PriorityQueue<Integer> minHeap = new PriorityQueue<Integer>();
 public PriorityQueue<Integer> maxHeap = new PriorityQueue<Integer>(Collections.reverseOrder());

    public void addNum(int num) {        
        // Add to appropriate heap
        if(minHeap.size()==0 ||  num > minHeap.peek()) {
            minHeap.add(num);
        } else {
                maxHeap.add(num);           
        }
        
        // Balance sizes of Min and Max Heaps
        if(minHeap.size()>maxHeap.size()+1) {
            maxHeap.add(minHeap.remove());
        } else if(maxHeap.size()>minHeap.size()){
            minHeap.add(maxHeap.remove());
        }
    }
    
    public double findMedian() {
        if(minHeap.size()>maxHeap.size()) {
            return Double.valueOf(minHeap.peek());
        }
        else {
            return (Double.valueOf(maxHeap.peek())+Double.valueOf(minHeap.peek()))/2;
        }
    }

No comments:

Post a Comment