Machine Learning 

Description

Description of the trend/domain

Machine Learning has emerged as an individual discipline to find insights over a set of data using two steps of processing:

  1. In this first step, I’m going to start teaching the computer to learn from existing data, identifying patterns, without being explicitly programmed for (Arthur Samuel), and giving a feedback or score, and trying to improve this score each time it receives more experience, similar to the education of a child.
  2. then, apply acquired knowledge to formulate predictions about unknown data; that is, classify new/unseen data based on known patterns extracted in the first step.

humans are good to recognize patterns, but with the help of high performance processing, computers can do this repeatable things so fast.

As an artificial intelligence (AI) field, Machine learning is developed using a set of algorithms, that can be classified in the following types, depending of the strategy used to learn, based on the existence of a label that indicates the target feature to infer:

  • supervised learning: showing a set of samples with the desired output present, the goal is try to map the internal behavior of the system where given a sets of inputs it respond with the desired output (We do not always have the luxury of having these outputs). Then the learned behavior is used to categorize unknown data.
  • unsupervised learning: the algorithm tries to extract the structure of the input, but it has not information about the desired output labels. Then it tries to group similar attributes.
  • semi-supervised: is the combination of the previous two techniques to deal with the absence of output samples to train the model.
  • reinforcement learning : it has no labels at all, but rewards that an AGENT can win, according its behavior in an ENVIRONMENT. For example, a video-game, that can vary around the time, and the AGENT can perform specific actions in this environment (eg: move,jump). These actions sometimes can result in a reward (increase score), and actions transform the environment moving to a new state where the agent can perform another action and so on. The idea is maximize the reward, trying to achieve a goal through the experience inside this dynamic environment.

According to the modeled problem, we can find another classification. Above we can list the most known techniques as seen in the following figure, but is important to capture the essence of the problem to take a decision to apply it:

  • Linear regression: given a point cloud, it tries to find the best adapted line, and having this parametric description, try to predict unknown values. The model not necessarily can be linear.
  • Classification: given an object with a feature vector, try to predict a class where this object belongs, previously trained with known objects with labels.
  • Clustering:  trying to join similar objects in groups.
  • DecisionTrees
  • Association Rules, to find frequent itemsets; basket problem is a particular instance of this kind of algorithms, where a subset of products can influence the buying process of another articles.
  • NLP (NaturalLanguageProcessing) and text analytics: where using preprocessing tehniques, we can analyze unstructured text. word2vec/glove, sentimental analysis
  • Outlier Detection: trying to find uncommon or rare samples inside the dataset.

 

 

Relation with other areas or techniques

Techniques, tools and methods for this area field are illustrated in the following figure; there are so many of them (see the figure in context), and related with some known areas of knowledge:

aaeaaqaaaaaaaan6aaaajgyyodc2ztk1ltm4zmetndkyyy05mgyxlwmzztcznjkzn2e3nw

  • DATA-SCIENCE: is the set of principles, processes and techniques that enable the extraction of knowledge from large amounts of data. Techniques and theories from statistics, machine learning, data mining and visualization are used inside data-science to understand insights in the form of patterns and correlations. The level of understanding is captured as a model, then deployed as a data-product or as a visualization artifact.
  • BIG-DATA: perceived as accumulation of data to find insights, and the opportunity to exploit this data presence through the usage of computational resources and practical ML application. The convergence of this aspects give us a unique opportunity to exploit this knowledge as never before this would have been done for the benefit of a company.
  • DATA-ANALYSIS: through a pipeline of steps, it guides the data-scientist to apply a technique to a problem. Data aquisiton, data wrangling, data exploration are made before establish a model for the problem, where we can use machine-learning to do it. Whatever was the technique selected, the model is generated and needs to be validated.
  • DATA-MINING: tries to dig over the data-set to find pattern an relations. this techniques can use unsupervised learning. We can use data mining first and then use a machine learning supervised technique over the extracted patterns.
  • STATISTICS: tries to achieve a model assuming a probability distribution of the sample. both tries to achieve a model of the reality; however statistical models are more concerned with understanding the data generation process whereas ML techniques are more concerned with producing the correct outputs (it means that the internal model may not be fully comprehensible according the used technique )
  • DEEP LEARNING: takes the basis of neural networks to build large scale structures of that, and considering different variations of this traditional model (multiple layers, convolutions, recurrent).
  • PARALLEL/DISTRIBUTED: techniques an algorithms in some cases take advantage of specialized processing of algorithms to distribute its processing in multiple cores, multiple machines or GPU processors to alleviate the time spend to process the data, sometimes based on principles of functional programming, and structures to join commodity hardware seen many machines as one unit of computing CLUSTER, sometimes hosted in the CLOUD. This fact added to the decrease in the value of storage represents a set of several factors converging today, an unimaginable capacity to be exploited by the data-scientist.

Overview

Overview of the trend 

 

(sketch of what processes are affected and what the future state might look like)

  1. WHO the trend affects and who benefits: 
    • Entire business is affected by the application of insights inside the business, and task automation where applicable.
    • End users experiments a better experience through custom recommendations0-bwk854atapwravrp
    • Demand from people who know about data-science, not limited only to the mathematics behind that but also about business (see figure on the right, original source) is a fact to be analyzed. This kind of studies about machine learning inside data science, requires specialized knowledge over certain areas of knowledge:
      • statistics, probability and mathematics
      • software development and hardware architectures
      • business knowledge of the domain modeled
    • As we can see in many known success stories, machine learning is affecting the daily life for us. Let’s see some examples:
      • Netflix (recommendation of movies/series for its users)
      • Amazon (recommendation of products for its users)
      • LegalRobot (Natural language processing over documents and foundations to legal attorneys)
      • The Analytics Edge‘s MIT course at edx, shows other applications of machine learning algorithms to industry.
  2. ——————————————————————————————–
  3. – maybe what sort of SDLC’s for DevOps this would be most helpful in, what environments it could pertain to (WHERE)
  4. WHY the trend is important (benefits): machine learning techniques are a core part of the new ways to do analytics inside the enterprise. Descriptive and Diagnostics techniques help us to know part events in the business, but Predictive and Prescriptive techniques help us to go one step ahead what is moving our business now, even when the events occur. Traditional Software Development Lifecycle for development of applications normally is used to bring operational applications to the business and can be related with the delivery of a scientific product to the enterprise.  However to conceive an analytics product, there are some aspects that need to be considered.
    • Business case evaluation, data Identification, its Acquisition and filtering: to start working with data and define the purpose of investment.
    • Extraction, Validation/cleansing, Aggregation & representation of the data needed for the analysis, pre-processing the sources of information and doing the needed joins to prepare the information to be analyzed.
    • Analysis, Visualization and finally Utilization: where we can apply different techniques and algorithms including ML techniques to conceive a model and doing a product for the business profit.
  5. Use cases
  6. Methods of implementation, or description of new process state
  7. How does architecture change?  Or the organization change?
  8. Systems required, views/viewpoints – technology, info/data arch, business arch as applicable

Best Practices

Best Practices (or case studies)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pretium lacus vitae elit tincidunt venenatis. Vivamus eget nibh vel nunc lacinia efficitur. Donec ac leo id ligula ullamcorper molestie dignissim sed velit. Integer ullamcorper turpis nec pulvinar accumsan. Aliquam accumsan efficitur varius. Nunc fermentum semper sem, sit amet imperdiet mauris interdum a. Praesent tristique iaculis sem et malesuada. Maecenas vitae scelerisque sem, ut interdum justo. Duis semper dui non est rhoncus pulvinar.

Aenean et consequat urna, ac efficitur velit. Pellentesque fringilla facilisis libero ac congue. Nulla ut dui sed risus tincidunt tempus non eu metus. Sed et feugiat enim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Proin ultricies leo et ex elementum, sed luctus erat venenatis. Mauris tincidunt at ipsum quis auctor.

Fusce non elit metus. Nullam condimentum, dolor nec semper fringilla, mauris dui tristique tellus, sed feugiat dolor enim non lectus. Donec fringilla vestibulum enim in bibendum. In molestie eros non hendrerit ultrices. Etiam sed mauris nec est congue congue. Fusce gravida viverra dictum. Nunc porttitor tristique dignissim. Donec egestas enim massa, at feugiat sapien sagittis at. Cras at posuere nibh, ut aliquam enim. Sed feugiat tellus libero, nec pellentesque quam pharetra vel. Morbi eget consequat neque, in posuere felis. Aliquam quis vulputate lacus. Vestibulum auctor, purus id pretium placerat, enim purus ullamcorper lectus, non elementum nibh tortor nec enim. Nam ut bibendum ex. Nam pellentesque molestie dolor vel porttitor. Vivamus consectetur maximus lorem sit amet efficitur.

Capabilities

Capabilities

How the trend is implemented as a set of either IT or business-supporting capabilities

b.     How we measure CITA proficiencies here – what would it mean to be a CITA-S or CITA-P in this area (having done it, able to talk about it, able to frame an initiative to implement, etc.)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi pretium lacus vitae elit tincidunt venenatis. Vivamus eget nibh vel nunc lacinia efficitur. Donec ac leo id ligula ullamcorper molestie dignissim sed velit. Integer ullamcorper turpis nec pulvinar accumsan. Aliquam accumsan efficitur varius. Nunc fermentum semper sem, sit amet imperdiet mauris interdum a. Praesent tristique iaculis sem et malesuada. Maecenas vitae scelerisque sem, ut interdum justo. Duis semper dui non est rhoncus pulvinar.

Aenean et consequat urna, ac efficitur velit. Pellentesque fringilla facilisis libero ac congue. Nulla ut dui sed risus tincidunt tempus non eu metus. Sed et feugiat enim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Proin ultricies leo et ex elementum, sed luctus erat venenatis. Mauris tincidunt at ipsum quis auctor.

Resources

For further review (Resources)

  1. Stanford machine learning traditional course at Coursera
  2. Johns Hopkins University DataScience Specialization at Coursera
  3. Why is machine learning hard by S. Zayd Enam.
  4. Any specific technology leaders, white papers, slideshares – like Chef/Puppet frameworks, ARM in Azure for DevOps
  5. Framework for assessing an organization’s readiness for this trend – like IO model? (basic, standardized, rationalized, dynamic?)
  6. A good article about reinforcement/deep learning.
  7. Courses of the edx platform exploring techniques over apache spark framework
  8. kaggle: famous site that hosts competitions.
  9. kdnuggets: interesting articles and news about data science related technologies
  10. Technologies
    1. apache spark
    2. python/anaconda scikit-learn,
    3. R/Rstudio
    4. apache Hadoop and famous implementations (cloudera, hortonworks, ibm- big-insights )
    5. google TensorFlow
    6. h2o.ai complements for apache spark

Author

Andrés Hurtado (https://co.linkedin.com/in/andhdo)