Spark, Hadoop, Hive and Programming: How to Design Twitter

Let’s get started with the problem – how to design Twitter.

System design questions are usually very general, thus not well-defined. That’s why many people don’t know how to get started.

we will only design core features of Twitter instead of everything.

So the whole product should allow people follow each other and view others feeds. It’s as simple as it is. (If any feature is needed, the interviewer should be able to clarify). Anything else like registration, moment, security etc. is out of the scope of discussion.

High-level solution

As we said before, don’t jump into all the details immediately, which will confuse interviewers and yourself as well.

The common strategy I would use here is to divide the whole system into several core components. There are quite a lot divide strategies, for example, you can divide by frontend/backend, offline/online logic etc..

In this question, I would design solutions for the following two things: 1. Data modeling. 2. How to serve feeds.

Data modeling – If we want to use a relational database like MySQL, we can define user object and feed object. Two relations are also necessary. One is user can follow each other, the other is each feed has a user owner.

Serve feeds – The most straightforward way is to fetch feeds from all the people you follow and render them by time.

The interview won’t be stopped here as there are tons of details we haven’t covered yet. It’s totally up to the interviewer to decide what will be discussed next.

Detail questions

With that in mind, there can be infinite extensions from the high-level idea. So I’ll only cover a few follow up questions here.

1. When users followed a lot of people, fetching and rendering all their feeds can be costly. How to improve this?
There are many approaches. Since Twitter has the infinite scroll feature especially on mobile, each time we only need to fetch the most recent N feeds instead of all of them. Then there will many details about how the pagination should be implemented.

You may also consider cache, which might also be helpful to speed things up.

2. How to detect fake users?
This can be related to machine learning. One way to do it is to identify several related features like registration date, the number of followers, the number of feeds etc. and build a machine learning system to detect if a user is fake.

3. Can we order feed by other algorithms?
There are a lot of debate about this topic over the past few weeks. If we want to order based on users interests, how to design the algorithm?

I would say few things we should clarify to the interviewer.

How to measure the algorithm? Maybe by the average time users spend on Twitter or users interaction like favorite/retweet.
What signals to use to evaluate how likely the user will like the feed? Users relation with the author, the number of replies/retweets of this feed, the number of followers of the author etc. might be important.
If machine learning is used, how to design the whole system?

4. How to implement the @ feature and retweet feature?
For @ feature, we can simply store a list of user IDs inside each feed. So when rendering your feeds, you should also include feeds that have your ID in its @ list. This adds a little bit complexity to the rendering logic.

For retweet feature, we could do the similar thing. Inside each feed, a feed ID (pointer) is stored, which indicates the original post if there’s any.

But be careful that when a user retweets a tweet that has been retweeted, you should be able to figure out the correct logic. This is a product decision whether you want to make it into many layers or only keep the original feed.

Twitter is already a huge product with tons of features. In this post, we’re gonna talk about how to design specific features of Twitter from the system design interview perspective.

Who to follow

Twitter also shows you suggestions about who to follow. Actually, this is a core feature that plays an important role in user onboarding and engagement.

If you play around the feature, you will notice that there are mainly two kinds of people that Twitter will show you – people you may know (friends) and famous account (celebrities/brands…).

It won’t be hard to get all these candidates as you can just search through user’s “following graph”and people within 2 or 3 steps aways are great candidates. Also, accounts with most followers can also be included.

The question would be how to rank them given that each time we can only show a few suggestions. I would lean toward using a machine learning system to do that.

There are tons of features we can use, e.g. whether the other person has followed this user, the number of common follows/followers, any overlap in basic information (like location) and so on so forth.

This is a complicated problem and there are various follow-up questions:

How to scale the system when there are millions/billions of users?
How to evaluate the system?
How to design the same feature for Facebook (bi-directional relationship)

Moments

Twitter shows you what’s trending now in hashtags. The feature is more complicated than trending topics and I think it’s necessary to briefly explain here.

Basically, Moments will show you a list of interesting topics for different categories (news, sports, fun etc.). For each topic, you will also get several top tweets discussing it. So it’s a great way to explore what’s going on at the current moment.

I’m pretty sure that there are a lot of ways to design this system. One option is to get hottest articles from news websites for the past 1-2 hours. For each article, find tweets related to it and figure out which category (news, sport, fun etc.) it belongs to. Then we can show this article as a trending topic in Moments.

Another similar approach is to get all the trending topics (same as the first section), figuring out each topic’s category, show them in Moment.

For both approaches, we would have the following three subproblems to solve: A. Categorize each tweet/topic to a category (news, sports etc.) B. Generate and rank trending topics at current moment C. Generate and rank tweets for each topic.

For A, we can pre-define several topics and do supervised learning. Or we may also consider clustering. In fact, text in tweets, user’s profile, follower’s comments contain a lot of information to make the algorithm accurate.

For B and C, since it’s similar to the first section of this post, I won’t talk about it now.

Search

Twitter’s search feature is another popular function that people use every day. If you totally have no idea about how search engine works, you may take a look at this tutorial.

If we limit our discussion only to the general feed search function (excluding users search and advanced search), the high-level approach can be pretty similar to Google search except that you don’t need to crawl the web. Basically, you need to build indexing, ranking and retrieval.

Things become quite interesting if you dig into how to design the ranking algorithm. Unlike Google, Twitter search may care more about freshness and social signals.

The most straightforward approach is to give each feature/signal a weight and then compute a ranking score for each tweet. Then we can just rank them by the score. Features can include reply/retweet/favorite numbers, relevance, freshness, users popularity etc..

But how do we evaluate the ranking and search? I think it’s better to define few core metrics like total number of searches per day, tweet click even followed by a search etc. and observe these metrics every day. They are also stats we should care whatever changes are made.

Spark, Hadoop, Hive and Programming

Monday, March 6, 2017

How to Design Twitter

Detail questions

Trending topics

Who to follow

Moments

Search

No comments:

Post a Comment