Data Science Interview Questions

Want to become an expert in cracking Data Science interview questions/Data Analysis interview questions?

Start with practicing the questions below. Whether a question involves multiple choice or live coding, we will give you hints as you go and tell you if your answers are correct or incorrect.

After that, take our timed public Data Science Interview Questions Test.

To use our service for testing candidates, buy a pack of candidates.


1. Pet Detection

Data Science Confusion matrix Machine learning Public New

A classifier that predicts if an image contains only a cat, a dog, or a llama produced the following confusion matrix:

  True values    
Dog Cat Llama
Predicted values     Dog 14 2 1
Cat 2 12 3
Llama 5 2 19

What is the accuracy of the model, in percentages?

Easy 
5min
 %
   


2. Petri Dish

Data Science Correlation Public New

Two bacteria cultures, A and B, were set up in two different dishes, each covering 50% of its dish. Over 20 days, bacteria A's percentage of coverage increased to 70% and bacteria B's percentage of coverage reduced to 40%:

Petri Dish

Easy 
5min

Which of the two bacterium's growth correlates more linearly with the number of days passed?


Approximately, what is the Pearson correlation coefficient of bacteria B's coverage?


If, after 20 days, bacteria A's coverage starts to correlate less with its linear trend line, what can we say about the value of its Pearson correlation coefficient?

   


3. AB Test

Data Science Bayes' theorem Probability Public New

Your company is running a test that is designed to compare two different versions of the company’s website.

Version A of the website is shown to 60% of users, while version B of the website is shown to the remaining 40%. The test shows that 8% of users who are presented with version A sign up for the company’s services, as compared to 4% of users who are presented with version B.

If a user signs up for the company’s services, what is the probability that she/he was presented with version A of the website?

Easy 
7min
 %
   


4. Login Table

Data Science Python data libraries Public New

A company stores login data and password hashes in two different containers:

  • DataFrame with columns: Id, Login, Verified.
  • Two-dimensional NumPy array where each element is an array that contains: Id and Password.

Elements on the same row/index have the same Id.

Implement the function login_table that accepts these two containers and modifies id_name_verified DataFrame in-place, so that:

  • The Verified column should be removed.
  • The password from NumPy array should be added as the last column with the name "Password" to DataFrame.

For example, the following code snippet:

id_name_verified = pd.DataFrame([[1, "JohnDoe", True], [2, "AnnFranklin", False]], columns=["Id", "Login", "Verified"])
id_password = np.array([[1, 987340123], [2, 187031122]], np.int32)
login_table(id_name_verified, id_password)
print(id_name_verified)

Should print:

   Id        Login   Password
0   1      JohnDoe  987340123
1   2  AnnFranklin  187031122
Easy 
15min
Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3  
 


  •   Example case: Wrong answer
  •   Column Verified is removed: Wrong answer
  •   Column Password is appended: Wrong answer
  •   Various DataFrames: Wrong answer


5. Iris Classifier

Data Science Classification Machine learning Public New

As a part of an application for iris enthusiasts, implement the train_and_predict function which should be able to classify three types of irises based on four features.

The train_and_predict function accepts three parameters:

  • train_input_features - a two-dimensional NumPy array where each element is an array that contains: sepal length, sepal width, petal length, and petal width.
  • train_outputs - a one-dimensional NumPy array where each element is a number representing the species of iris which is described in the same row of train_input_features. 0 represents Iris setosa, 1 represents Iris versicolor, and 2 represents Iris virginica.
  • prediction_features - two-dimensional NumPy array where each element is an array that contains: sepal length, sepal width, petal length, and petal width.

The function should train a classifier using train_input_features as input data and train_outputs as the expected result. After that, the function should use the trained classifier to predict labels for prediction_features and return them as an iterable (like list or numpy.ndarray). The nth position in the result should be the classification of the nth row of the prediction_features parameter.

Easy 
20min
Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3  
 


  •   Accuracy on the example case is higher or equal to 80%: Wrong answer
  •   Accuracy is higher or equal to 75% on data with noise: Wrong answer
  •   Accuracy is higher or equal to 85% on data with noise: Wrong answer


6. Marketing Costs

Data Science Linear regression Machine learning Python data libraries Public New

Implement the desired_marketing_expenditure function, which returns the required amount of money that needs to be invested in a new marketing campaign to sell the desired number of units.

Use the data from previous marketing campaigns to evaluate how the number of units sold grows linearly as the amount of money invested increases.

For example, for the desired number of 60,000 units sold and previous campaign data from the table below, the function should return the float 250,000.

Previous campaigns

Campaign Marketing expenditure Units sold
#1 300,000 60,000
#2 200,000 50,000
#3 400,000 90,000
#4 300,000 80,000
#5 100,000 30,000
Hard 
30min
Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3  
 


  •   Example case: Wrong answer
  •   Linear dependency without error: Wrong answer
  •   Linear dependency with error: Wrong answer


7. Stock Prices

Data Science Correlation Data aggregation Python data libraries Public

You are given a list of tickers and their daily closing prices for a given period.

Implement the most_corr function that, when given each ticker's daily closing prices, returns the pair of tickers that are the most highly (linearly) correlated by daily percentage change.

Hard  
30min
Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3  
 


  •   Example case: Wrong answer
  •   Small data set: Wrong answer
  •   Large data set: Wrong answer


If you feel ready, take one of our timed public Data Science Interview Questions tests:
  • Data Science Test (Easy / Hard)
  • Data Science and SQL Online Test (Easy / Hard)
Not exactly what you are looking for? Go to our For Jobseekers section.