Thursday, March 30, 2017

PySpark Basic Commands


rddRead.first() : Return the first element from the dataset.
rddRead.take(5) : Return the first n lines from the dataset and display them on the console.
rddRead.count() : Return number of lines in a RDDt
table.columns : Display columns in a table
rdd.distinct().count() : Count distinct records

TableName.columns : Show list of columns on that table




Wednesday, March 29, 2017

Pydev with Eclipse

Installing with the update site

Note: Instructions are targeted at Eclipse 4.6 onwards
To install PyDev and PyDev Extensions using the Eclipse Update Manager, you need to use the Help > Install New Software... menu.
image0

In the next screen, add the update site(s) you want to work with from the list below:
Latest version:

Monday, March 27, 2017

Python Basic Commands

print("Hello World!")

O/P: Hello World!

# Single line Comment

''' 
Multi line comments
blah blah blah
'''

name = "Venkat"  # variable name with string value
print(name)

O/P: Venkat

# Data Types : Numbers, Strings, Lists, Tuples, and Dictionaries

name = 15 # variable name with number value

# Operators : + - * / %  ** //
print "10 + 5 =",10+5
print "10 - 5 =", 10-5
print "10 * 5 =", 10*5
print "10 / 5 =", 10/5
print "10 ** 5 =", 10**5
print "10 // 5 =", 10//5

O/P:
10 + 5 = 15
10 - 5 = 5
10 * 5 = 50
10 / 5 = 2
10 ** 5 = 100000
10 // 5 = 2

# Print 5 new lines
print "\n" * 5

# Lists
fruits_list = ["mango", "orange", "apple", "banana"]
# print first fruit
print fruits_list[0]
#print first fruit to third fruit
print fruits_list[1:3]

other_events = ["pick up kids", "wash car", "check balance"]
# List of lists
to_do_list = [fruits_list, other_events]
print to_do_list
#print third item of second list
print to_do_list[1][2]

# Append item to a list
fruits_list.append("guava")
print fruits_list

# Insert item to list at specified index
fruits_list.insert(1, "avocado")
print fruits_list

# Remove item from list
fruits_list.remove("guava")

#  Sort list
fruits_list.sort()
print fruits_list

# Reverse list
fruits_list.reverse()
print fruits_list

# Delete item from list
del fruits_list[3]
print fruits_list

O/P:
mango
['orange', 'apple']
check balance
['mango', 'orange', 'apple', 'banana', 'guava']
['mango', 'avocado', 'orange', 'apple', 'banana', 'guava']
['apple', 'avocado', 'banana', 'mango', 'orange']
['orange', 'mango', 'banana', 'avocado', 'apple']
['orange', 'mango', 'banana', 'apple']
['pick up kids', 'wash car', 'check balance', 'orange', 'mango', 'banana', 'apple']
7
wash car
apple

# Tuples : Sequence of immutable Python objects.
'''The tuples cannot be changed unlike lists and tuples use parentheses (), whereas lists use square brackets [].
'''
new_tuple = (1,2,3,4,5)
# Convert tuple to list
new_list = list(new_tuple)
print new_list
# Convert list to tuple
new_tuple2 = tuple(new_list)
print new_tuple2

O/P:
[1, 2, 3, 4, 5]
(1, 2, 3, 4, 5)

# Dictionary : Key and Values pairs represented in {key1:value1, key2:val2}
'''
Keys are unique within a dictionary while values may not be.The values of a dictionary can be of any type, but the keys must be of an immutable data type such as strings, numbers, or tuples.
'''

String Format

'{2}, {1}, {0}'.format('a', 'b', 'c')
O/P:
'c, b, a'

'{0},{1},{0}'.format('abra', 'cad')   # arguments' indices can be repeated
O/P:
'abra,cad,abra'

'Coordinates: {latitude}, {longitude}'.format(latitude='37.24N', longitude='-115.81W')
O/P:
'Coordinates: 37.24N, -115.81W'

coord = {'latitude': '37.24N', 'longitude': '-115.81W'}
 'Coordinates: {latitude}, {longitude}'.format(**coord)
O/P:
'Coordinates: 37.24N, -115.81W'

Strip()
str.strip([chars]);
The method strip() returns a copy of the string in which all chars have been stripped from the beginning and the end of the string (default whitespace characters).

AWS S3 Commands

Create buckets

$ aws s3 mb s3://bucket-name


Remove buckets which are empty
$ aws s3 rb s3://bucket-name

Remove buckets which are non-empty
$ aws s3 rb s3://bucket-name --force

List all buckets
$ aws s3 ls

List all objects and folders (prefixes) in a bucket
$ aws s3 ls s3://bucket-name

Lists the objects in bucket-name/path (objects in bucket-name filtered by the prefix path).
$ aws s3 ls s3://bucket-name/path

Copy an object into a bucket. It grants read permissions on the object to everyone and full permissions (readreadacl, and writeacl) to the account associated with user@example.com.

$ aws s3 cp file.txt s3://my-bucket/ --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers full=emailaddress=user@example.com

sync command 
$ aws s3 sync <source> <target> [--options]
$ aws s3 sync . s3://my-bucket/path upload: MySubdirectory\MyFile3.txt to s3://my-bucket/path/MySubdirectory/MyFile3.txt upload: MyFile2.txt to s3://my-bucket/path/MyFile2.txt upload: MyFile1.txt to s3://my-bucket/path/MyFile1.txt

Normally, sync only copies missing or outdated files or objects between the source and target. However, you may supply the --delete option to remove files or objects from the target not present in the source.
// Sync with deletion - object is deleted from bucket $ aws s3 sync . s3://my-bucket/path --delete delete: s3://my-bucket/path/MyFile1.txt

The --exclude and --include options allow you to specify rules to filter the files or objects to be copied during the sync operation.

Local directory contains 3 files: MyFile1.txt MyFile2.rtf MyFile88.txt ''' $ aws s3 sync . s3://my-bucket/path --exclude '*.txt' upload: MyFile2.rtf to s3://my-bucket/path/MyFile2.rtf ''' $ aws s3 sync . s3://my-bucket/path --exclude '*.txt' --include 'MyFile*.txt' upload: MyFile1.txt to s3://my-bucket/path/MyFile1.txt upload: MyFile88.txt to s3://my-bucket/path/MyFile88.txt upload: MyFile2.rtf to s3://my-bucket/path/MyFile2.rtf ''' $ aws s3 sync . s3://my-bucket/path --exclude '*.txt' --include 'MyFile*.txt' --exclude 'MyFile?.txt' upload: MyFile2.rtf to s3://my-bucket/path/MyFile2.rtf upload: MyFile88.txt to s3://my-bucket/path/MyFile88.txt


the s3 command set includes cpmvls, and rm, and they work in similar ways to their Unix counterparts. The following are some examples.

// Copy MyFile.txt in current directory to s3://my-bucket/path
$ aws s3 cp MyFile.txt s3://my-bucket/path/

// Move all .jpg files in s3://my-bucket/path to ./MyDirectory
$ aws s3 mv s3://my-bucket/path ./MyDirectory --exclude '*' --include '*.jpg' --recursive

// List the contents of my-bucket
$ aws s3 ls s3://my-bucket

// List the contents of path in my-bucket
$ aws s3 ls s3://my-bucket/path

// Delete s3://my-bucket/path/MyFile.txt
$ aws s3 rm s3://my-bucket/path/MyFile.txt

// Delete s3://my-bucket/path and all of its contents
$ aws s3 rm s3://my-bucket/path --recursive
When the --recursive option is used on a directory/folder with cpmv, or rm, the command walks the directory tree, including all subdirectories.// List of files in human readable form with sizes in KB/MB/GB$ aws s3 ls s3://mybucket/path --recursive --human-readable --summarize--human-readable displays file size in Bytes/MiB/KiB/GiB/TiB/PiB/EiB. --summarize displays the total number of objects and total size at the end of the result listing:

Wednesday, March 15, 2017

How to run Scala or Python scripts in Spark

How to run Python script in Spark?
$ spark-submit test.py

How to run scala script in Spark?
$ spark-shell -i test.scala

Friday, March 10, 2017

What is database shard?

database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load.



ScaleBase Proxy Sharding
Add caption


Image result for database sharding

Image result for database sharding

What is COGROUP in Pig?

The COGROUP operator works more or less in the same way as the  GROUP operator. The only difference between the two operators is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

grunt> cogroup_data = COGROUP student_details by age, employee_details by age;

Tuesday, March 7, 2017

Basic Spark questions

What are ways to create a RDD in Spark?

Ans: There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

How many partitions are created in Spark?

By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

What are RDD Operations?

RDDs support two types of operations: 

  • transformations, which create a new dataset from an existing one
  • actions, which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

What is Accumulator?

Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster.

How do you print few elements of RDD?

rdd.take(100).foreach(println)

Removing Data
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

What is difference between Coalesce and Repartition?

coalesce(numPartitions)Decrease the number of partitions in the RDD to numPartitions.
Useful for running operations more efficiently after filtering down a large dataset.
You can try to increase the number of partitions with coalesce, but it won’t work!
repartition(numPartitions)Reshuffle the data in the RDD randomly to create either more or fewer partitions and
balance it across them. This always shuffles all data over the network.
The repartition algorithm does a full shuffle of the data and creates equal sized partitions of data.

Monday, March 6, 2017

How to get column names in oracle database?

How to get column names in oracle database?

select COLUMN_NAME from ALL_TAB_COLUMNS where TABLE_NAME='abc';

Program to print last 10 lines

Given some text lines in one string, each line is separated by ‘\n’ character. Print the last ten lines. If number of lines is less than 10, then print all lines.

Sol 1:

Take a queue of size 10 and inserting each line in it.
and if lines are more than 10 insert 11th and delete 1 in queue.
and when last n occur print the strings from the queue.

Queue<String> lines = new Queue<String>();
for(String tmp; (tmp = br.readLine()) != null;) {
    if (lines.size() >= 10)
        lines.remove();
    lines.add(tmp);
}

Sol 2:

public String[] lastNLines(String fileName) throws IOException {
BufferedReader bufferReader = null;
String[] lines = new String[10];
try {
bufferReader = new BufferedReader(new FileReader(fileName));
int count = 0;
String line = null;
while ((line = bufferReader.readLine()) != null) {
lines[count % lines.length] = line;
count++;
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
if (bufferReader != null)
bufferReader.close();

}
return lines;

}

Design Facebook Chat Function

How to design Facebook chat function?

First and foremost, as I mentioned in previous posts, system design interviews can be extremely diversified. It’s mostly up to the interviewer to decide which direction to discuss. As a result, different interviewers can have completely different discussions even with the same question and you should never expect this article to be something like a standard answer.

Basic infrastructure

It’s better to have a high-level solution and talk about the overall infrastructure. If you have no prior experience with messaging app, you might find it not easy to come up with a basic solution. But that’s totally fine. Let’s have a very naive solution and optimize it later.
Basically, one of the most common ways to build a messaging app is to have a chat server that acts as the core of the whole system. When a message comes, it won’t be sent to the receiver directly. Instead, it goes to the chat server and is stored there first. And then, based on the receiver’s status, the server may send the message immediately to him or send a push notification.
A more detailed flow works like this:
  • User A wants to send message “Hello Gainlo” to user B. A first send the message to the chat server.
  • The chat server receives the message and sends an acknowledgement back to A, meaning the message is received. Based on the product, the front end may display a single check mark in A’s UI.
  • Case 1: if B is online and connected to the chat server, that’s great. The chat server just sends the message to B.
  • Case 2: If B is not online, the chat server sends a push notification to B.
  • B receives the message and sends back an acknowledgement to the chat server.
  • The chat server notifies A that B received the message and updates with a double check mark in A’s UI.

Real-time

The whole system can be costly and inefficient once it’s scaled to certain level. So any way we can optimize the system in order to support a huge amount of concurrent requests?
There are many approaches. One obvious cost here is that when delivering messages to the receiver, the chat server might need to spawn an OS process/thread, initialize HTTP (maybe other protocol) request and close connection at the end. In fact, this happens to every message. Even if we do the other way around that the receiver keeps requesting the server to check if there’s any new message, it’s still costly.
One solution is to use HTTP persistent connection. In a nutshell, receivers can make an HTTP GET request over a persistent connection that doesn’t return until the chat server provides any data back. Each request will be re-established when it’s timed out or interrupt. This approach provides a lot of advantages in terms of response time, throughput and cost.

Online notification

Another cool feature of Facebook chat is showing online friends. Although the feature seems to be simple at the first glance, it improves user experience tremendously and it’s definitely worth to discuss. If you are asked to design this feature, how would you do it?
Obviously, the most straightforward approach is that once a user is online, he sends a notification to all his friends. But how would you evaluate the cost of this?
When it’s at the peak time, we roughly need O(average number of friends * peak users) of requests, which can be a lot when there are millions of users. And this cost can be even more than the message cost itself. One idea to improve this is to reduce unnecessary requests. For instance, we can issue notification only when this user reloads a page or sends a message. In other words, we can limit the scope to only “very active users”. Or we won’t send notification until a user has been online for 5min. This solves the cases where a user shows online and immediately goes offline.

How to Design Twitter

Let’s get started with the problem – how to design Twitter.
System design questions are usually very general, thus not well-defined. That’s why many people don’t know how to get started.
we will only design core features of Twitter instead of everything.
So the whole product should allow people follow each other and view others feeds. It’s as simple as it is. (If any feature is needed, the interviewer should be able to clarify). Anything else like registration, moment, security etc. is out of the scope of discussion.
High-level solution
As we said before, don’t jump into all the details immediately, which will confuse interviewers and yourself as well.
The common strategy I would use here is to divide the whole system into several core components. There are quite a lot divide strategies, for example, you can divide by frontend/backend, offline/online logic etc..
In this question, I would design solutions for the following two things: 1. Data modeling. 2. How to serve feeds.
Data modeling – If we want to use a relational database like MySQL, we can define user object and feed object. Two relations are also necessary. One is user can follow each other, the other is each feed has a user owner.
Serve feeds – The most straightforward way is to fetch feeds from all the people you follow and render them by time.
The interview won’t be stopped here as there are tons of details we haven’t covered yet. It’s totally up to the interviewer to decide what will be discussed next.

Detail questions

With that in mind, there can be infinite extensions from the high-level idea. So I’ll only cover a few follow up questions here.
1. When users followed a lot of people, fetching and rendering all their feeds can be costly. How to improve this?
There are many approaches. Since Twitter has the infinite scroll feature especially on mobile, each time we only need to fetch the most recent N feeds instead of all of them. Then there will many details about how the pagination should be implemented.
You may also consider cache, which might also be helpful to speed things up.
2. How to detect fake users?
This can be related to machine learning. One way to do it is to identify several related features like registration date, the number of followers, the number of feeds etc. and build a machine learning system to detect if a user is fake.
3. Can we order feed by other algorithms?
There are a lot of debate about this topic over the past few weeks. If we want to order based on users interests, how to design the algorithm?
I would say few things we should clarify to the interviewer.
  • How to measure the algorithm? Maybe by the average time users spend on Twitter or users interaction like favorite/retweet.
  • What signals to use to evaluate how likely the user will like the feed? Users relation with the author, the number of replies/retweets of this feed, the number of followers of the author etc. might be important.
  • If machine learning is used, how to design the whole system?
4. How to implement the @ feature and retweet feature?
For @ feature, we can simply store a list of user IDs inside each feed. So when rendering your feeds, you should also include feeds that have your ID in its @ list. This adds a little bit complexity to the rendering logic.
For retweet feature, we could do the similar thing. Inside each feed, a feed ID (pointer) is stored, which indicates the original post if there’s any.
But be careful that when a user retweets a tweet that has been retweeted, you should be able to figure out the correct logic. This is a product decision whether you want to make it into many layers or only keep the original feed.
Twitter is already a huge product with tons of features. In this post, we’re gonna talk about how to design specific features of Twitter from the system design interview perspective.

Trending topics

Twitter shows trending topics at both the search page and your left column of the home page (maybe somewhere else as well). Clicking each topic will direct you to all related tweets.
The question would be how to design this feature.
If you remember what we have said in the previous post, it’s recommended to have high-level approach first. In a nutshell, I would divide the problem into two subproblems: 1. How to get trending topic candidates? 2. How to rank those candidates?
For topic candidates, there are various ideas. We can get the most frequent hashtags over the last N hours. We can also get the hottest search queries. Or we may even fetch the recent most popular feeds and extract some common words or phrases. But personally, I would go with the first two approaches.
Ranking can be interesting. The most straightforward way is to rank based on frequency. But we can further improve it. For instance, we can integrate signals like reply/retweet/favorite numbers, freshness. We may also add some personalized signals like whether there are many follows/followers talking about the topic.

Who to follow

Twitter also shows you suggestions about who to follow. Actually, this is a core feature that plays an important role in user onboarding and engagement.
If you play around the feature, you will notice that there are mainly two kinds of people that Twitter will show you – people you may know (friends) and famous account (celebrities/brands…).
It won’t be hard to get all these candidates as you can just search through user’s “following graph”and people within 2 or 3 steps aways are great candidates. Also, accounts with most followers can also be included.
The question would be how to rank them given that each time we can only show a few suggestions. I would lean toward using a machine learning system to do that.
There are tons of features we can use, e.g. whether the other person has followed this user, the number of common follows/followers, any overlap in basic information (like location) and so on so forth.
This is a complicated problem and there are various follow-up questions:
  • How to scale the system when there are millions/billions of users?
  • How to evaluate the system?
  • How to design the same feature for Facebook (bi-directional relationship)

Moments

Twitter shows you what’s trending now in hashtags. The feature is more complicated than trending topics and I think it’s necessary to briefly explain here.
Basically, Moments will show you a list of interesting topics for different categories (news, sports, fun etc.). For each topic, you will also get several top tweets discussing it. So it’s a great way to explore what’s going on at the current moment.
I’m pretty sure that there are a lot of ways to design this system. One option is to get hottest articles from news websites for the past 1-2 hours. For each article, find tweets related to it and figure out which category (news, sport, fun etc.) it belongs to. Then we can show this article as a trending topic in Moments.
Another similar approach is to get all the trending topics (same as the first section), figuring out each topic’s category, show them in Moment.
For both approaches, we would have the following three subproblems to solve: A. Categorize each tweet/topic to a category (news, sports etc.) B. Generate and rank trending topics at current moment C. Generate and rank tweets for each topic.
For A, we can pre-define several topics and do supervised learning. Or we may also consider clustering. In fact, text in tweets, user’s profile, follower’s comments contain a lot of information to make the algorithm accurate.
For B and C, since it’s similar to the first section of this post, I won’t talk about it now.

Search

Twitter’s search feature is another popular function that people use every day. If you totally have no idea about how search engine works, you may take a look at this tutorial.
If we limit our discussion only to the general feed search function (excluding users search and advanced search), the high-level approach can be pretty similar to Google search except that you don’t need to crawl the web. Basically, you need to build indexing, ranking and retrieval.
Things become quite interesting if you dig into how to design the ranking algorithm. Unlike Google, Twitter search may care more about freshness and social signals.
The most straightforward approach is to give each feature/signal a weight and then compute a ranking score for each tweet. Then we can just rank them by the score. Features can include reply/retweet/favorite numbers, relevance, freshness, users popularity etc..
But how do we evaluate the ranking and search? I think it’s better to define few core metrics like total number of searches per day, tweet click even followed by a search etc. and observe these metrics every day. They are also stats we should care whatever changes are made.