Tuesday, December 8, 2020

Longest Substring Without Repeating Characters

Given a string s, find the length of the longest substring without repeating characters.

 Example 1:

Input: s = "abcabcbb"
Output: 3
Explanation: The answer is "abc", with the length of 3.

Example 2:

Input: s = "bbbbb"
Output: 1
Explanation: The answer is "b", with the length of 1.

Example 3:

Input: s = "pwwkew"
Output: 3
Explanation: The answer is "wke", with the length of 3.
Notice that the answer must be a substring, "pwke" is a subsequence and not a substring.

Example 4:

Input: s = ""
Output: 0
Java (Using HashMap)
class Solution {
    public int lengthOfLongestSubstring(String s) {        
        int n = s.length(), ans=0;
        Map<Character, Integer> map = new HashMap<>();
        for(int i=0, j=0; j<n; j++) {
            if(map.containsKey(s.charAt(j))) {
                i = Math.max(map.get(s.charAt(j)), i);
            }
            ans = Math.max(ans, j-i+1);
            map.put(s.charAt(j), j+1);
        }
        return ans;        
    }
}
  • Time complexity : O(n). Index j will iterate n times.

  • Space complexity (HashMap) : O(min(m, n))

Friday, October 9, 2020

Maximum Width of Binary Tree v2

 Given a binary tree, write a function to get the maximum width of the given tree. The maximum width of a tree is the maximum width among all levels.

The width of one level is defined as the length between the end-nodes (the leftmost and right most non-null nodes in the level, where the null nodes between the end-nodes are also counted into the length calculation.

It is guaranteed that the answer will in the range of 32-bit signed integer.

Example 1:

Input: 

           1
         /   \
        3     2
       / \     \  
      5   3     9 

Output: 4
Explanation: The maximum width existing in the third level with the length 4 (5,3,null,9).

Example 2:

Input: 

          1
         /  
        3    
       / \       
      5   3     

Output: 2
Explanation: The maximum width existing in the third level with the length 2 (5,3).

Example 3:

Input: 

          1
         / \
        3   2 
       /        
      5      

Output: 2
Explanation: The maximum width existing in the second level with the length 2 (3,2).

Example 4:

Input: 

          1
         / \
        3   2
       /     \  
      5       9 
     /         \
    6           7
Output: 8
Explanation:The maximum width existing in the fourth level with the length 8 (6,null,null,null,null,null,null,7).
Solution: 
public int widthOfBinaryTree(TreeNode root) {
        if (root == null) {
            return 0;
        }
        Queue<TreeNode> queue = new LinkedList<>();
        int max = 1, leftmost = 1, rightmost = 1;
        root.val = 1;
        queue.add(root);
        while (!queue.isEmpty()) {
            int size = queue.size();
            for (int i = 0; i < size; i++) {
                TreeNode cur = queue.remove();
                if (i == 0) {
                    leftmost = cur.val;
                }
                if (i == size - 1) {
                    rightmost = cur.val;
                }
                if (cur.left != null) {
                    cur.left.val = cur.val * 2;
                    queue.add(cur.left);
                }
                if (cur.right != null) {
                    cur.right.val = cur.val * 2 + 1;
                    queue.add(cur.right);
                }
            }
            max = Math.max(max, rightmost - leftmost + 1);

        }
        return max;
    }

Wednesday, September 30, 2020

Good reads

 

Good Reads

Here are some useful links for reading:

1. Dynamo - Highly Available Key-value Store

2. Kafka - A Distributed Messaging System for Log Processing

3. Consistent Hashing - Original paper

4. Paxos - Protocol for distributed consensus

5. Concurrency Controls - Optimistic methods for concurrency controls

6. Gossip protocol - For failure detection and more.

7. Chubby - Lock service for loosely-coupled distributed systems

8. ZooKeeper - Wait-free coordination for Internet-scale systems

9. MapReduce - Simplified Data Processing on Large Clusters

10. Hadoop - A Distributed File System

Saturday, September 26, 2020

Design a Key Value Store

 cover topics like basic key-value storage system, distributed key-value storage and scaling issues including sharding, all of which are possible to be covered in system design interviews.

 

Basic key-value storage

How would you design a simple key-value storage system in a single machine?

The most straightforward thing is to use a hash table to store key-value pairs, which is how most this kind of systems work nowadays. A hash table allows you to read/write a key-value pair in constant time and it’s extremely easy to use. Most languages have built-in support for this.

However, the downside is also obvious. Using a hash table usually means you need to store everything in memory, which may not be possible when the data set is big. There are two common solutions:

  • Compress your data. This should be the first thing to think about and often there are a bunch of stuff you can compress. For example, you can store reference instead of the actual data. You can also use float32 rather than float64. In addition, using different data representations like bit array (integer) or vectors can be effective as well.
  • Storing in a disk. When it’s impossible to fit everything in memory, you may store part of the data into disk. To further optimize this, you can think of the system as a cache system. Frequently visited data is kept in memory and the rest is on a disk.

 

Distributed key-value storage

The most interesting topic is definitely scaling the key-value storage into multiple machines. If you want to support big data, you will implement distributed system for sure. Let’s see how we can design a distributed key-value storage system.

If you have read Design a Cache System, you will notice that a lot of concepts here are exactly the same.

Since a single machine doesn’t have enough storage for all the data, the general idea here is to split the data into multiple machines by some rules and a coordinator machine can direct clients to the machine with requested resource. The question is how to split the data into multiple machines and more importantly, what is a good strategy to partition data?

 

Sharding

Suppose all the keys are URLs like http://gainlo.co and we have 26 machines. One approach is to divide all keys (URLs) based on the first character of the URL (after “www”) to these 26 machines. For example, http://gainlo.co will be stored at machine G and http://blog.gainlo.co will be stored at machine B. So what are disadvantages of this design?

Let’s ignore cases where URL contains ASCII characters. A good sharding algorithm should be able to balance traffic equally to all machines. In other words, each machine should receive equal requests ideally. Apparently, the above design doesn’t work well. First of all, the storage is not distributed equally. Probably there are much more URLs starting with “a” than “z”. Secondly, some URLs are much more popular like Facebook and Google.

In order to balance the traffic, you’d better make sure that keys are distributed randomly. Another solution is to use the hash of URL, which usually have much better performance. To design a good sharding algorithm, you should fully understand the application and can estimate the bottleneck of the system.

 

System availability

To evaluate a distributed system, one key metric is system availability. For instance, suppose one of our machines crashes for some reason (maybe hardware issue or program bugs), how does this affect our key-value storage system?

Apparently, if someone requests resources from this machine, we won’t be able to return the correct response. You may not consider this issue when building a side project. However, if you are serving millions of users with tons of servers, this happens quite often and you can’t afford to manually restart the server every time. This is why availability is essential in every distributed system nowadays. So how would you address this issue?

Of course you can write more robust code with test cases. However, your program will always have bugs. In addition, hardware issues are even harder to protect. The most common solution is replica. By setting machines with duplicate resources, we can significantly reduce the system downtime. If a single machine has 10% of chance to crash every month, then with a single backup machine, we reduce the probability to 1% when both are down.

 

Replica VS sharding

At first glance, replica is quite similar to sharding. So what’s the relation of these two? And how would you choose between replica and sharding when designing a distributed key-value store?

First of all, we need to be clear about the purpose of these two techniques. Sharding is basically used to splitting data to multiple machines since a single machine cannot store too much data. Replica is a way to protect the system from downtime. With that in mind, if a single machine can’t store all the data, replica won’t help.

 

Consistency

By introducing replicas, we can make the system more robust. However, one issue is about consistency. Let’s say for machine A1, we have replica A2. How do you make sure that A1 and A2 have the same data? For instance, when inserting a new entry, we need to update both machines. But it’s possible that the write operation fails in one of them. So over time, A1 and A2 might have quite a lot inconsistent data, which is a big problem.

There are a couple of solutions here. First approach is to keep a local copy in coordinator. Whenever updating a resource, the coordinator will keep the copy of updated version. So in case the update fails, the coordinator is able to re-do the operation.

Another approach is commit log. If you have been using Git, the concept of commit log should be quite familiar to you. Basically, for each node machine, it’ll keep the commit log for each operation, which is like the history of all updates. So when we want to update an entry in machine A, it will first store this request in commit log. And then a separate program will process all the commit logs in order (in a queue). Whenever an operation fails, we can easily recover as we can lookup the commit log.

The last approach I’d like to introduce is to resolve conflict in read. Suppose when the requested resource locates in A1, A2 and A3, the coordinator can ask from all three machines. If by any chance the data is different, the system can resolve the conflict on the fly.

It’s worth to note that all of these approaches are not mutually exclusive. You can definitely use multiple ones based on the application.

 

Read throughput

I’d also like to briefly mention read throughput in this post. Usually, key-value storage system should be able to support a large amount of read requests. So what approaches will you use to improve read throughput?

To improve read throughput, the common approach is always taking advantage of memory. If the data is stored in disk inside each node machine, we can move part of them in memory. A more general idea is to use cache. Since the post – design a cache system has an in-depth analysis of this topic, I won’t talk about it too much here.


Wednesday, May 20, 2020

Performance of Delta Vs Parquet file formats



spark.sql("set spark.databricks.delta.autoCompact.enabled = true")
spark.sql("set spark.databricks.delta.optimizeWrite.enabled = true")



OPTIMIZE the Databricks Delta table   
     
display(spark.sql("OPTIMIZE flights ZORDER BY (DayofWeek)"))

The query over the Databricks Delta table runs much faster after OPTIMIZE is run. 
How much faster the query runs can depend on the configuration of the cluster you are 
running on, however should be 5-10X faster compared to the standard table.

References:

https://docs.databricks.com/_static/notebooks/delta/optimize-scala.html

https://docs.databricks.com/delta/optimizations/index.html#compaction-bin-packing

https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html



Tuesday, February 4, 2020

Longest Consecutive SubArray

Given an array of unsorted integers, find the length of the longest sub-sequence (or subarray) such that elements in the subsequence are consecutive integers, the consecutive numbers can be in any order.
Examples:
Input: arr[] = {1, 9, 3, 10, 4, 20, 2}
Output: 4
 The subsequence 1, 3, 4, 2 is the longest subsequence of consecutive elements

Input: arr[] = {36, 41, 56, 35, 44, 33, 34, 92, 43, 32, 42}
Output: 5
The subsequence 36, 35, 33, 34, 32 is the longest subsequence
of consecutive elements.

Ans:

Simple Solution is to first sort the array and find the longest subarray with consecutive elements. Time complexity of this solution is O(nLogn). 

Linear Time O(n) Solution:

The idea is to use Hashing. We first insert all elements to a HashSet. Then check all the possible starts of consecutive subsequences. Below is the algorithm.
  • Create an empty HashSet.
  • Insert all array elements to HashSet.
  • Do following for every element arr[i]
  • Check if this element is the starting point of a subsequence. To check this, we simply look for arr[i] – 1 in the hashset, if not found, then this is the first element in a subsequence.
  • If this element is the first element, then count number of elements in the consecutive starting with this element. Iterate from arr[i] + 1 till the last element that can be found.
  • If the count is more than the previous longest subsequence found, then update this.
// Returns length of the longest consecutive subsequence
    Public static int findLongestConseqSubseq(int arr[],int n)
    {
        HashSet<Integer> S = new HashSet<Integer>();
        int longest = 0;
  
        // Hash all the array elements
        for (int i=0; i<n; ++i)
            S.add(arr[i]);
  
        // check each possible sequence from the start
        // then update optimal length
        for (int i=0; i<n; ++i)
        {
            // if current element is the starting
            // element of a sequence
            if (!S.contains(arr[i]-1))
            {
                // Then check for next elements in the
                // sequence
                int j = arr[i];
                while (S.contains(j))
                    j++;
  
                // update  optimal length if this length
                // is more
                if (longest <j-arr[i])
                    longest = j-arr[i];
            }
        }
        return longest;
    }