, Is Alpha
, Word Count
, Read First Line
, Read Write Execute
, Log Patch
, Category Tree
, Chain Link
, Cheapest Product
, Employee Manager
, Merge Stock Index
, Movies Live
, Student Activities
, Youngest Child
, Log Parser
, Language Teacher
, Manager Sales
, Average Salary
, Movie Genres
, Auto Show
, Book Sale
, Moving Total
, Reward Points
, Tuple Slice
, Unique Product
, Internal Nodes
, Paper Strip
, Kilometer Converter
, Medical Record
, Date Transform
, Numbers to Text
, Cargo Ship
, Unique Numbers
, Max Sum
, Crop Ratio
, Class Grades
, Age and Earnings
, Cubic Approximation
, Credit Score
, Patient Classification
, Distribution Fitting
, Wine Quality
, Credit Wizard
, Median Height
, Clean CSV
, Birthday Cards
, Free Throws
, Student Rankings
, Welfare Organization
, Bacterial Growth
, Student Max Score
Python is a widely used, high-level, general-purpose, interpreted, dynamic programming language. Having a basic familiarity with the programming language used on the job is a prerequisite for quickly getting up to speed.
Everyone makes mistakes. A good programmer should be able to find and fix a bug in their or someone else's code.
A programmer should use a language as a tool, always taking advantage of language-specific data types and built-in functions.
A list comprehension is a syntactic construct for creating a list based on existing lists. As this is a common task, every programmer should be familiar with it.
The string data structure is used to represent text. It is one of the most commonly used data structures. Therefore, every programmer should be skilled at string manipulation.
Arithmetic is a fundamental branch of mathematics. An understanding of arithmetic concepts, and their application, is important for every candidate.
Exceptions exist in most modern programming languages, making it important for a programmer to understand them and know how to handle them.
Monkey Patching is a method of either adding new or overriding existing functionality without the creation of a new type. As such it's an important tool for developers to be familiar with.
SQL is the dominant technology for accessing application data. It is increasingly becoming a performance bottleneck when it comes to scalability. Given its dominance, SQL is a crucial skill for all engineers.
Conditional statements are a feature of most programming and query languages. They allow the programmer to control what computations are carried out based on a Boolean condition.
The SELECT statement is used to select data from a database. It is the most used SQL command.
A dictionary (or associative array) is a data type composed of a collection of key-value pairs, where each possible key appears at most once in the collection. It is used when we need to access items by their keys.
A linked list is a linear collection of data elements where each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence. It is usually used for advanced scenarios where we need fast access to the next element, or when we need to remove an element from anywhere in the collection.
An aggregate function is typically used in database queries to group together multiple rows to form a single value of meaningful data. A good programmer should be skilled at using data aggregation functions when interacting with databases.
Subqueries are commonly used in database interactions, making it important for a programmer to be skilled at writing them.
Knowing how to order data is a common task for every programmer.
LEFT JOIN is one of the ways to merge rows from two tables. We use it when we also want to show rows that exist in one table, but don't exist in the other table.
The UNION operator is used to combine the result-set of two or more SELECT statements. It is often used when a report needs to be made based on multiple tables.
The GROUP BY statement groups rows by some attribute into summary rows. It is a common command when making various reports.
Even though most database insert queries are simple, a good programmer should know how to handle more complicated situations like batch inserts.
A normalized database is normally made up of multiple tables. Joins are, therefore, required to query across multiple tables.
Familiarity with data serialization to and from formats such as XML and JSON is important as it is commonly used for interprocess communication
Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. The design goals of XML emphasize simplicity, generality, and usability across the Internet. This is one of the most used formats for exchanging data over the web.
A regular expression (regex) is a special text string for describing a search pattern. It is a common way for extracting data from text.
JSON is an open-standard format that uses human-readable text to transmit data objects consisting of attribute-value pairs. It's the most common data format used for asynchronous browser/server communication.
Every programmer should be familiar with data-sorting methods, as sorting is very common in data-analysis processes.
In object-oriented programming, inheritance is the mechanism of basing a class upon another class, retaining similar implementation. Inheritance allows programmers to reuse code and is a must know topic for every programmer who works with OOP languages.
Object-oriented programming is a paradigm based on encapsulating logic and data into objects, which may then contain fields and procedures. Many of the most widely used programming languages are based on OOP, making it a very important concept in modern programming.
A CTE (Common Table Expression) is a temporary result set that can be referenced within another SELECT, INSERT, UPDATE, or DELETE statement. Recursive CTEs can reference themselves, which enables developers to work with hierarchical data.
The CASE statement is SQL's control statement. It goes through conditions and returns a value.
When designing and/or analyzing an algorithm or data structure, it is important to consider the performance and structure of an implementation. Algorithmic thinking is one of the key traits of a good programmer, especially one working on complex or performance-critical code.
A set is a collection of distinct objects. It's one of the most used types of collection, alongside arrays, lists, and maps. There are many different types of set, each with multiple specific optimizations and use cases. It is, therefore, one of the most important collections for a developer to be familiar with.
A tuple is an immutable collection which is ordered and unchangeable. It is a common collection in many programming languages.
Lists are collections that act as dynamic arrays. Lists offer the flexibility of dynamically sized arrays, the simplicity of access of arrays, and are more performant than more ubiquitous collections in most scenarios.
A tree is a hierarchical structure defined recursively starting with the root node, where each node is a data structure consisting of a value, together with a list of references to other nodes (the "children"). A lot of problems can be solved efficiently with trees, which makes them important for developers.
Method overriding, in object-oriented programming, is a language feature that allows a subclass to provide a specific implementation of a method that is already provided by one of its parent classes.
A stream is a sequence of data elements made available over time. It is particularly useful for tasks that may benefit from being asynchronous, including tasks such as I/O processing or reading from a file, and as such is important for developers to understand.
A queue is a collection of items that are maintained in a sequence and can be modified by the addition of entities at one end of the sequence and removal from the other end of the sequence. It is the collection to be used when first-in-first-out (FIFO) collection is needed.
Named Tuple is a tuple where each value has a preassigned name. It allows accessing values not just by index, but also by name. Among other things, it can increase the readability and maintainability of the code.
Iteration is the act of repeating a process, or cycling through a collection. Iteration is one of the fundamental flow control tools available to developers.
Python Data Science
Integer division is division in which the fractional part (remainder) is discarded. Knowing this is important for optimal implementation of some algorithms and for avoiding common bugs.
Python Data Science
The Python programming language and its libraries contain a lot of functionality that's useful to data scientists. Powerful libraries like Numpy, Pandas, and Scipy are valuable tools for data scientists who use Python.
Grouping is the process of separating items into different groups. Developers and data scientists often need to group data so they can examine them separately.
NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy is an essential library for any data scientist who works with Python.
General Data Science
Pandas is a library for the Python programming language that’s used for data manipulation and analysis. It is an essential library for any data scientist who works with Python.
General Data Science
When we need to discover the information hidden in vast amounts of data, or make smarter decisions to deliver even better products, data scientists hold the key to the answers you need.
Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring within a fixed interval of time and/or space, if these events occur with a known average rate and independently of the time since the last event. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
Probability theory is the foundation of most statistical and machine-learning algorithms.
Linear regression is one of the most frequently used methods for data analysis due to its simplicity and applicability to a wide variety of problems.
Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It’s important for all tasks where it’s infeasible to construct conventional algorithms, which is often the case in Data Science.
Nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. Since many problems are not linear, nonlinear regression is important for machine learning practitioners.
Scikit-learn (or sklearn) is a machine learning library for the Python programming language. Every data scientist who works with Python and tasks such as classification, regression, and clustering algorithms should know how to use it.
Classification is the problem of identifying to which set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. As one of the common tasks in machine learning, it’s important for all data scientists.
An important Data Science algorithm, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression.
A receiver operating characteristic curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate against the false positive rate at all possible decision boundaries. It is useful for selecting possibly optimal models and to discard suboptimal ones prior to specifying decision boundaries.
In a binary classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.
Binomial distribution is the discrete probability distribution of the number of successes in a sequence of independent yes/no experiments, each of which yields success with a given probability.
An important concept, p-value is defined as the probability of obtaining a result equal to or "more extreme" than what was actually observed, when the null hypothesis is true.
Cauchy distribution is the distribution of the ratio of two independent normally distributed Gaussian random variables. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
Exponential distribution is the probability distribution that describes the time between events in a process in which events occur continuously and independently at a constant average rate. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
Normal distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. As one of the most widely used distributions, it is important for all Data Scientists to be familiar with it.
SciPy is a Python library used for scientific and technical computing. Every data scientist who uses Python as a programming language should know how to use it for tasks such as optimization, linear algebra, integration, etc.
Correlation is any statistical relationship, whether causal or not, between two random variables or two sets of data. As one of the fundamentals of Data Science, correlation is an important concept for all Data Scientists to be familiar with.
Multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. As such, it’s important for all data scientists to check for collinear variables when looking at individual predictor variables in multiple regression models.
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences. It is usually a tool for displaying an algorithm that contains only conditional control statements and is a must-know for every data scientist.
Data cleaning or data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records. Data scientists should be familiar with it to avoid incorrect records that can affect analysis.
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. Processing CSV files is a common task when working with tabular data.
Data aggregation is the process of gathering and summarizing information in a specified form. It is a common component of most statistical analysis processes.
Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points. This is basic knowledge of every data scientist.
The performance of an application or system is important. The responsiveness and scalability of an application are all related to how performant an application is. Each algorithm and query can have a large positive or negative effect on the whole system.