# Data Science Interview Questions

##### Practice for Data Science interviews by solving TestDome questions. Our interview questions are used by more than 5,000 companies and 450,000 individual test takers.

Companies: Use Our Tests for Screening

Try to solve 7 Data Science interview questions and Data Analysis interview questions below. Hints can help you find answers to questions you are having trouble with.

## 1. Pet Detection

Confusion matrix Machine learning

Easy

A classifier that predicts if an image contains only a cat, a dog, or a llama produced the following confusion matrix:

 True values Dog Cat Llama Predicted values Dog 14 2 1 Cat 2 12 3 Llama 5 2 19

What is the accuracy of the model, in percentages?

%

## 2. Petri Dish

Correlation

Easy

Two bacteria cultures, A and B, were set up in two different dishes, each covering 50% of its dish. Over 20 days, bacteria A's percentage of coverage increased to 70% and bacteria B's percentage of coverage reduced to 40%:

Which of the two bacterium's growth correlates more linearly with the number of days passed?

Approximately, what is the Pearson correlation coefficient of bacteria B's coverage?

If, after 20 days, bacteria A's coverage starts to correlate less with its linear trend line, what can we say about the value of its Pearson correlation coefficient?

## 3. AB Test

Bayes' theorem Probability

Easy

Your company is running a test that is designed to compare two different versions of the company’s website.

Version A of the website is shown to 60% of users, while version B of the website is shown to the remaining 40%. The test shows that 8% of users who are presented with version A sign up for the company’s services, as compared to 4% of users who are presented with version B.

If a user signs up for the company’s services, what is the probability that she/he was presented with version A of the website?

%

Pandas

Easy

A company stores login data and password hashes in two different containers:

• DataFrame with columns: Id, Login, Verified.
• Two-dimensional NumPy array where each element is an array that contains: Id and Password.

Elements on the same row/index have the same Id.

Implement the function login_table that accepts these two containers and modifies id_name_verified DataFrame in-place, so that:

• The Verified column should be removed.
• The password from NumPy array should be added as the last column with the name "Password" to DataFrame.

For example, the following code snippet:

``````id_name_verified = pd.DataFrame([[1, "JohnDoe", True], [2, "AnnFranklin", False]], columns=["Id", "Login", "Verified"])
id_password = np.array([[1, 987340123], [2, 187031122]], np.int32)
print(id_name_verified)``````

Should print:

```   Id        Login   Password
0   1      JohnDoe  987340123
1   2  AnnFranklin  187031122
```
Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3

•   Column Verified is removed: Wrong answer

## 5. Iris Classifier

Classification Machine learning NumPy Scikit-learn

Easy

As a part of an application for iris enthusiasts, implement the train_and_predict function which should be able to classify three types of irises based on four features.

The train_and_predict function accepts three parameters:

• train_input_features - a two-dimensional NumPy array where each element is an array that contains: sepal length, sepal width, petal length, and petal width.
• train_outputs - a one-dimensional NumPy array where each element is a number representing the species of iris which is described in the same row of train_input_features. 0 represents Iris setosa, 1 represents Iris versicolor, and 2 represents Iris virginica.
• prediction_features - two-dimensional NumPy array where each element is an array that contains: sepal length, sepal width, petal length, and petal width.

The function should train a classifier using train_input_features as input data and train_outputs as the expected result. After that, the function should use the trained classifier to predict labels for prediction_features and return them as an iterable (like list or numpy.ndarray). The nth position in the result should be the classification of the nth row of the prediction_features parameter.

Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3

•   Accuracy on the example case is higher or equal to 80%: Wrong answer
•   Accuracy is higher or equal to 75% on data with noise: Wrong answer
•   Accuracy is higher or equal to 85% on data with noise: Wrong answer

## 6. Marketing Costs

Linear regression Machine learning NumPy Scikit-learn

Hard

Implement the desired_marketing_expenditure function, which returns the required amount of money that needs to be invested in a new marketing campaign to sell the desired number of units.

Use the data from previous marketing campaigns to evaluate how the number of units sold grows linearly as the amount of money invested increases.

For example, for the desired number of 60,000 units sold and previous campaign data from the table below, the function should return the float 250,000.

Previous campaigns

Campaign Marketing expenditure Units sold
#1 300,000 60,000
#2 200,000 50,000
#3 400,000 90,000
#4 300,000 80,000
#5 100,000 30,000
Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3

•   Linear dependency without error: Wrong answer
•   Linear dependency with error: Wrong answer

## 7. Stock Prices

Correlation NumPy Pandas

Hard

You are given a list of tickers and their daily closing prices for a given period.

Implement the most_corr function that, when given each ticker's daily closing prices, returns the pair of tickers that are the most highly (linearly) correlated by daily percentage change.

Python 3.7.4, Pandas 0.25.1, Numpy 1.16.5, Scipy 1.3.1, Scikit-learn 0.21.3