Description of the trend/domain
Machine Learning has emerged as an individual discipline to find insights over a set of data using two steps of processing:
- In this first step, I’m going to start teaching the computer to learn from existing data, identifying patterns, without being explicitly programmed for (Arthur Samuel), and giving a feedback or score, and trying to improve this score each time it receives more experience, similar to the education of a child.
- then, apply acquired knowledge to formulate predictions about unknown data; that is, classify new/unseen data based on known patterns extracted in the first step.
humans are good to recognize patterns, but with the help of high performance processing, computers can do this repeatable things so fast.
As an artificial intelligence (AI) field, Machine learning is developed using a set of algorithms, that can be classified in the following types, depending of the strategy used to learn, based on the existence of a label that indicates the target feature to infer:
- supervised learning: showing a set of samples with the desired output present, the goal is try to map the internal behavior of the system where given a sets of inputs it respond with the desired output (We do not always have the luxury of having these outputs). Then the learned behavior is used to categorize unknown data.
- unsupervised learning: the algorithm tries to extract the structure of the input, but it has not information about the desired output labels. Then it tries to group similar attributes.
- semi-supervised: is the combination of the previous two techniques to deal with the absence of output samples to train the model.
- reinforcement learning : it has no labels at all, but rewards that an AGENT can win, according its behavior in an ENVIRONMENT. For example, a video-game, that can vary around the time, and the AGENT can perform specific actions in this environment (eg: move,jump). These actions sometimes can result in a reward (increase score), and actions transform the environment moving to a new state where the agent can perform another action and so on. The idea is maximize the reward, trying to achieve a goal through the experience inside this dynamic environment.
According to the modeled problem, we can find another classification. Above we can list the most known techniques as seen in the following figure, but is important to capture the essence of the problem to take a decision to apply it:
- Linear regression: given a point cloud, it tries to find the best adapted line, and having this parametric description, try to predict unknown values. The model not necessarily can be linear.
- Classification: given an object with a feature vector, try to predict a class where this object belongs, previously trained with known objects with labels.
- Clustering: trying to join similar objects in groups.
- Association Rules, to find frequent itemsets; basket problem is a particular instance of this kind of algorithms, where a subset of products can influence the buying process of another articles.
- NLP (NaturalLanguageProcessing) and text analytics: where using preprocessing tehniques, we can analyze unstructured text. word2vec/glove, sentimental analysis
- Outlier Detection: trying to find uncommon or rare samples inside the dataset.
Relation with other areas or techniques
Techniques, tools and methods for this area field are illustrated in the following figure; there are so many of them (see the figure in context), and related with some known areas of knowledge:
- DATA-SCIENCE: is the set of principles, processes and techniques that enable the extraction of knowledge from large amounts of data. Techniques and theories from statistics, machine learning, data mining and visualization are used inside data-science to understand insights in the form of patterns and correlations. The level of understanding is captured as a model, then deployed as a data-product or as a visualization artifact.
- BIG-DATA: perceived as accumulation of data to find insights, and the opportunity to exploit this data presence through the usage of computational resources and practical ML application. The convergence of this aspects give us a unique opportunity to exploit this knowledge as never before this would have been done for the benefit of a company.
- DATA-ANALYSIS: through a pipeline of steps, it guides the data-scientist to apply a technique to a problem. Data aquisiton, data wrangling, data exploration are made before establish a model for the problem, where we can use machine-learning to do it. Whatever was the technique selected, the model is generated and needs to be validated.
- DATA-MINING: tries to dig over the data-set to find pattern an relations. this techniques can use unsupervised learning. We can use data mining first and then use a machine learning supervised technique over the extracted patterns.
- STATISTICS: tries to achieve a model assuming a probability distribution of the sample. both tries to achieve a model of the reality; however statistical models are more concerned with understanding the data generation process whereas ML techniques are more concerned with producing the correct outputs (it means that the internal model may not be fully comprehensible according the used technique )
- DEEP LEARNING: takes the basis of neural networks to build large scale structures of that, and considering different variations of this traditional model (multiple layers, convolutions, recurrent).
- PARALLEL/DISTRIBUTED: techniques an algorithms in some cases take advantage of specialized processing of algorithms to distribute its processing in multiple cores, multiple machines or GPU processors to alleviate the time spend to process the data, sometimes based on principles of functional programming, and structures to join commodity hardware seen many machines as one unit of computing CLUSTER, sometimes hosted in the CLOUD. This fact added to the decrease in the value of storage represents a set of several factors converging today, an unimaginable capacity to be exploited by the data-scientist.