Sunday, October 11, 2015

Interview questions for Data Scientists

In preparation for a coding interview I collected numerous coding questions for Data Scientist. The question are not sorted and I have done little to clean them up. Still they might be useful for others:

What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?

What is: collaborative filtering, n-grams, map reduce, cosine distance?

How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?

How would you come up with a solution to identify plagiarism?

How to detect individual paid accounts shared by multiple users?

Should click data be handled in real time? Why? In which contexts?

What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?

What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation?

How do you handle missing data? What imputation techniques do you recommend?

Compare SAS, R, Python, Perl

What is the curse of big data?

You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)

How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?

What are hash table collisions? How is it avoided? How frequently does it happen?

How to make sure a mapreduce application has good load balance? What is load balance?

Examples where mapreduce does not work? Examples where it works very well?

Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

What is star schema? Lookup tables?

Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.

What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?

Do you think 50 small decision trees are better than a large one? Why?

Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?

Why is mean square error a bad measure of model performance? What would you suggest instead?

How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?

What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?

Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?

Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?

How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)

Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

What is root cause analysis? How to identify a cause vs. a correlation? Give examples.

How would you define and measure the predictive power of a metric?

How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set - the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?

What is POC (proof of concept)?

Is it better to have too many false positives, or too many false negatives?

Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.

How does Zillow's algorithm work? (to estimate the value of any home in US)

Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?

What is an efficiency curve? What are its drawbacks, and how can they be overcome?

What is a recommendation engine? How does it work?

What is an exact test? How and when can simulations help us when we do not use an exact test?

What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points - each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?

Give a few examples of "best practices" in data science.

You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations?

More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).

You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

Find the second largest element in a Binary Search Tree

Write a function that takes in two sorted lists and outputs a sorted list that is their union

Generating a sorted vector from two sorted vectors

How many people must be gathered together in a room, before you can be certain that there is a greater than 50/50 chance that at least two of them have the same birthday (

How do you take millions of users with 100's of transactions each, amongst 10k's of products and group the users together in a meaningful segments

How do you know if one algorithm is better than other?”

You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

given n samples from a uniform distribution [0, d], how to estimate d?

The interview drew a hypothetical histogram of number of purchases per user.) There are a lot of users that make a small number of purchases, and few users that make a large number of purchases. So the histogram is peaked near zero and has a tail off to the left. Based on this, what do you expect the plot of average revenue per user to look like?

There are 25 horses. You can race any 5 of them at once, and all you get is the order they finished. How many races would you need to find the 3 fastest horses?

Dynamic programming/backward induction on a multi-stage decision making problem

Imagine you have N pieces of rope in a bucket. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of loops in the bucket?

How do you test whether a new credit risk scoring model works? What data would you look at?

Given a set of specifications, make a program that would find if a given list of words was in a provided grid of letters

How would you sort a billion integers? Merge sort algorithm Palindrome test using recursion. Statistics: A/B testing process

Big data related question. Hadoop system. Top K record selection. External merge sort. Very basic JAVA questions: anonymous class. Implement some function

How would you classify a group of claim adjuster write-ups to detect fraud?

Find the median of a large dataset

Given a particular data set, suggest three algorithms, one regression, one clustering, one classification

If we were testing product X, what metrics would you look at to determine if it is a success

You're at a casino with two dice, if you roll a 5 you win, and get paid $10. What is your expected payout? If you play until you win (however long that takes) then stop, what is your expected payout?

In a social network, calculate the minimum number of steps between any two nodes

How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections

How would you build and test a metric to compare two user's ranked lists of movie/tv show preferences?

Write a SQL query to compute a frequency table of a certain attribute involving two joins. What if you want to GROUP or ORDER BY some attribute? What changes would you need to make? How would you account for NULLs?

Given two binary strings, write a function that adds them . You are not allowed to use any built in string to int conversions or parsing tools. E.g. Given "100" and "111" you should return "1011". What is the time and space complexity of your algorithm?

Lets say the population on Facebook clicks ads with a click-through-rate of P. We select a sample of size N and examine the sample's conversion rate, denoted by hat{P}, what is the minimum sample size N such that Probability( ABS(hat{P} - P) < DELTA ) = 95%. In other words (this is my translation), find the minimum sample size N such that our sample estimate hat{P} is within DELTA of the true click through rate P, with 95% confidence.

return the largest palindrome from a very long string

The graph 6-degree Bacon Number questions is their favorite.

How many birthday posts occur on Facebook on a given day?

In the midst of presenting work I had done on time-series data using Python, I was asked how to fill in missing values using PostgreSQL commands.

How to determine the size of a heel from pictures using computer analysis

SQL merges joins, how will you apply machine learning to a business case, explain the algorithm and why you chose it

What goes into the calculation of test duration

Walk us through a complex analysis or experiment that you conducted

Extract data from SQL database and attempt to find differences between fraudulent and non-fraudulent internet purchases (unsupervised)

How would you get all the unique values from an attribute in R?

What is the probability of getting a 5 on throwing dice 7 times?

Difference between box plot and histogram?

What model would you use to categorize new observations from a training set?

How to do treat colinearity? How do you inspect missing data and when are they important?

Explain what is an ROC curve

Did I use a Q-Q plot?

Given activity logs from the product, what features would you build to predict churn?

What is the GIL in Python?

Fit a model to data

Explain what is target leakage

Write down a function that received 1- s: string 2- k: int and returns the top k most frequent words in the string

There is a building with 100 floors. You are given 2 identical eggs. How do you use 2 eggs to find the threshold floor, where the egg will definitely break from any floor above floor N, including floor N itself

You randomly draw a coin from 100 coins - 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, whats the probability that the coin is unfair

How to build a data structure for arithmetic expressions

Given a list of integers (positive & negative), write an algorithm to find whether there's at least a pair of integers that sum up to zero. How do you improve your algorithm? (re: Big-O notation)

Derive logistic regression

Given a list of words, find all the anagrams

Given a list of integers, find all combinations that sum to a given integer

Detailed Machine learning algorithms and pattern recognition methods with their objective functions

What is your favorite model and why?

Given a huge collection of books, how would you tag each book based on genre?

How would you build profiles for user searches on google?

Write a sorting algorithm for a numerical dataset in Python?

Explain SIFT (scale invariant feature transform) workflow, Support vector Machines, Newton Optimization method.”

How do you count a table of likes with timestamps?

What was the angle between the clock hands at 3:15?

List 3 types of unit root tests in addition to the augmented Dickey-Fuller test

Probabilities when rolling two dice twice with the second roll depending on the outcome of the first, the German tank problem, breaking a 4-digit code with different prior information etc. In total: know your combinatorics and Bayes rule in-and-out!

You own a museum and consider putting up an exhibition by a new artist. How would you go about the decision? What kind of information / data would you look for, how would you use it? - You want to buy a TV on How do you make a decision? - You performed A/B testing on a change in a search engine algorithm. Although in B, the coders made a mistake which led to really poor search results, user clicked on more results as well as on more ads. How do you interpret the results, what do you do next?

How to compute an inverse matrix faster by playing around with some computational tricks

Hypothesis testing, p value, different classification algorithms, logistic regression vs neural network?

What metrics would you use to track whether Uber's strategy of using paid advertising to acquire customers works? How would you then figure out an acceptable cost of customer acquisition?

How would you go about assessing whether insider trading is occurring, using the data of your choice?

Removing the missing values. What if it causes bias? What will you do then?

Find prime numbers less than 1 billion using map reduce framework.

How do you draw a uniform random sample from a circle in polar coordinates?

If there are 30 groups of servers, each with an uptime of 0.99, and you select 1 at random, what is the probability that at least one server in the group chosen will be up?

Case study - what features would you use to determine credit risk given transaction history from the past two years. Explain a simple map reduce problem Read in a very large file of tab delimited numbers using python and count frequency of each number (don’t overthink this. Use python done in a few lines) Whats more likely: Getting at least one six in 6 rolls, at least two sixes in 12 rolls or at least 100 sixes in 600 rolls? Find all the combinations of change you can for a given amount

Newton's method to calculate the square root?

What do you know about A/B testing in the context of streaming?

1) Check if an integer is a palindrome (do not convert the integer to string) 2) Given 2 sorted arrays of integers, code to find a number from each array such that their sum are closest to some integer K? 3) What is the degree of freedom for lasso?

k-means clustering ,linear classifier,mySQL replication,stored procedure,R language

Given the set a a b b a a c c b b of unknown length, write an algorithm that figures out which occurs most frequently and with the most continuous repetition.

How do you model, a very low probability event?

Consider a game with 2 players, A and B. Player A has 8 stones, player B has 6. Game proceeds as follows. First, A rolls a fair 6-sided die, and the number on the die determines how many stones A takes over from B. Next, B rolls the same die, and the exact same thing happens in reverse. This concludes the round. Whoever has more stones at the end of the round wins and the game is over. If players end up with equal # of stones at the end of the round, it is a tie and another round ensues. What is the probability that B wins in 1, 2, ..., n rounds?

What is a statistical method to control the number of features for large sparse matrices?

How to put the same number of red and blue beans into two baskets, so if you randomly grab a basket you have a better chance to get red beans?

Write hadoop map/reduce to compute product of N matrices?

Given you have a regression with m variables and another n are available, would you want to include them? What would you do next?

How to tell the difference between classifiers. what is the difference between Lasso and Ridge regression?