Need to practice your Data Science skills for an upcoming job interview? Try solving these Data Science interview questions that test knowledge of linear regression, machine learning, probability, and other skills. We’ll provide feedback on your answers, and you can use a hint if you get stuck.

These Data Science interview questions are examples of real tasks used by employers to screen job candidates such as data analysts, data scientists, statisticians, and others that need to analyze data, extract information, and suggest conclusions in order to support decision making.

**1. Pet Detection**

A classifier that predicts if an image contains only a cat, a dog, or a llama produced the following confusion matrix:

True values | ||||

Dog | Cat | Llama | ||

Predicted values | Dog | 14 | 2 | 1 |

Cat | 2 | 12 | 3 | |

Llama | 5 | 2 | 19 |

What is the accuracy of the model, in percentages?

**%**

**2. Petri Dish**

Two bacteria cultures, A and B, were set up in two different dishes, each covering 50% of its dish. Over 20 days, bacteria A's percentage of coverage increased to 70% and bacteria B's percentage of coverage reduced to 40%:

Which of the two bacterium's growth correlates more linearly with the number of days passed?

Approximately, what is the Pearson correlation coefficient of **bacteria B**'s coverage and the number of days passed?

If, after 20 days, **bacteria A**'s coverage starts to correlate less with its linear trend line, what can we say about the value of its Pearson correlation coefficient?

**3. Login Table**

A company stores login data and password hashes in two different containers:

*DataFrame*with columns:*Id*,*Login*,*Verified*.- Two-dimensional
*NumPy array*where each element is an array that contains:*Id*and*Password*.

Elements on the same row/index have the same *Id*.

Implement the function *login_table* that accepts these two containers and **modifies id_name_verified DataFrame in-place**, so that:

- The
*Verified*column should be removed. - The password from
*NumPy array*should be added as the last column with the name "*Password*" to*DataFrame*.

For example, the following code snippet:

```
id_name_verified = pd.DataFrame([[1, "JohnDoe", True], [2, "AnnFranklin", False]], columns=["Id", "Login", "Verified"])
id_password = np.array([[1, 987340123], [2, 187031122]], np.int32)
login_table(id_name_verified, id_password)
print(id_name_verified)
```

Should print:

Id Login Password 0 1 JohnDoe 987340123 1 2 AnnFranklin 187031122

- Example case: Wrong answer
- Column Verified is removed: Wrong answer
- Column Password is appended: Wrong answer
- Various DataFrames: Wrong answer

**4. Election Poll**

Each day during 2019 an agency asked a hundred randomly selected people which party they would vote for if elections were held that day. Results of the poll were recorded in the following file. The Workers' Party asked for the report which they plan to use to improve their strategy for upcoming elections.

Fill in the missing values in the report for 2019:

- The arithmetic mean of votes for the Workers' Party is:
*(rounded to one decimal place)* - The median of votes for the Workers' party is:
*(rounded to closest integer)* - The standard deviation of votes for the Workers' party is:
*(rounded to one decimal place)* - The difference between the largest and the smallest number of votes for the Workers' party for March is:
- The largest number of votes that any party received on any day is: votes.

That maximum was achieved on 2019-- by . - The party with the largest difference between the maximum and minimum number of votes is . That difference is votes.

**5. Iris Classifier**

As a part of an application for iris enthusiasts, implement the *train_and_predict *function which should be able to classify three types of irises based on four features.

The *train_and_predict *function accepts three parameters:

*train_input_features*- a two-dimensional NumPy array where each element is an array that contains: sepal length, sepal width, petal length, and petal width.*train_outputs*- a one-dimensional NumPy array where each element is a number representing the species of iris which is described in the same row of*train_input_features*. 0 represents Iris setosa, 1 represents Iris versicolor, and 2 represents Iris virginica.*prediction_features*- two-dimensional NumPy array where each element is an array that contains: sepal length, sepal width, petal length, and petal width.

The function should train a classifier using *train_input_features *as input data and *train_outputs* as the expected result. After that, the function should use the trained classifier to predict labels for *prediction_features *and return them as an iterable (like list or numpy.ndarray). The nth position in the result should be the classification of the nth row of the *prediction_features *parameter.

- Accuracy on the example case is higher or equal to 80%: Wrong answer
- Accuracy is higher or equal to 75% on data with noise: Wrong answer
- Accuracy is higher or equal to 85% on data with noise: Wrong answer

**6. AB Test**

Your company is running a test that is designed to compare two different versions of the company’s website.

Version A of the website is shown to 60% of users, while version B of the website is shown to the remaining 40%. The test shows that 8% of users who are presented with version A sign up for the company’s services, as compared to 4% of users who are presented with version B.

If a user signs up for the company’s services, what is the probability that she/he was presented with version A of the website?

**%**

**7. Dog Classification**

The following .csv file contains the data from a classifier model that predicts if an image contains a dog: predictions.csv

The first column contains information if the dog is in the image or not. The second column contains the classifier prediction, which is in the interval 0-100, with higher values meaning that the classifier is more confident that image contains a dog.

What is the value of the decision boundary that will maximize the accuracy of the model? Values greater than or equal to the decision boundary will be treated as positive.

**8. Marketing Costs**

Implement the *desired_marketing_expenditure* function, which returns the required amount of money that needs to be invested in a new marketing campaign to sell the desired number of units.

Use the data from previous marketing campaigns to evaluate how the number of units sold grows **linearly** as the amount of money invested increases.

For example, for the desired number of 60,000 units sold and previous campaign data from the table below, the function should return the float 250,000.

Previous campaigns

Campaign | Marketing expenditure | Units sold |
---|---|---|

#1 | 300,000 | 60,000 |

#2 | 200,000 | 50,000 |

#3 | 400,000 | 90,000 |

#4 | 300,000 | 80,000 |

#5 | 100,000 | 30,000 |

- Example case: Wrong answer
- Linear dependency without error: Wrong answer
- Linear dependency with error: Wrong answer

**9. Stock Prices**

You are given a list of tickers and their daily closing prices for a given period.

Implement the *most_corr* function that, when given each ticker's daily closing prices, returns the pair of tickers that are the most highly (linearly) correlated by **daily percentage change**.