Non-tech answers to FAQs on Big Data Analytic

FAQ

Data science solutions could seem overly technical and complicated at times. AlgoTactica strives to communicate these concepts in a clear and straightforward manner that enables clients to feel comfortable and reassured about the solutions that we provide. If the FAQ’s below do not address your query, then please do not hesitate to contact us for a dialogue.

What is predictive analytics?

The field of predictive analytics uses mathematical and statistical methods that enable the extraction of information from data, thereby providing the knowledge to empower confident business decision making. The methods can organize enormous volumes of raw data and distill it into concise information components, that can then be used to minimize risk exposure and optimize business performance.

By analyzing the data, these methods can assign a predictive score for each business entity that is under study, including customer purchase propensity, future sales volume, credit default potential, etc. Typically, these scores are assigned by a mathematical model that has been trained on data containing information which reflects the previous business experience of the company. It is the scores that are then used to guide business decision activities.

How can smaller businesses benefit from Big Data?

For small business applications, it is possible to design an open source data analytics platform tailored to the specific needs of the small business, without incurring the huge software licensing costs normally associated with proprietary platforms offered by commercial vendors. The only direct costs involved would be associated with the purchase of computer hardware and the contracting of a data science consultancy following a firm fixed price quotation.

A low-cost data analytics system can be built that encompasses the following data management areas of practice, and their associated free-of-charge open source technologies:

  • Data Stream Capture: social media channels, as well as other data streams that are generated in real-time, can be captured and managed using Apache Spark Streaming, as well as Apache Storm.
  • Data Storage: storage of Big Data across multiple computer systems can be achieved by utilizing the open-source Hadoop Distributed File System (HDFS), as well as open source distributed databases that include Apache Cassandra and Apache HBase, amongst others.
  • Data Analytics: analytical and predictive algorithms can be designed and trained using Apache Spark machine learning library, H2O machine learning library, Microsoft R Open, and Python.
  • Data Visualization and User Interaction: graphical user interfaces for data display and user input can be designed using several library packages in Microsoft R Open, as well as JavaFX. Moreover, displays written in Java/JavaFX can directly call Microsoft R Open and invoke statistical analysis procedures on demand via the rJava interface.

Read more on this topic

What is predictive marketing?

Essentially, Predictive Marketing strives to anticipate what each customer’s next need/wish likely is based on an in-depth understanding of his/her relationship with your business. This is often described as having a 360 view of each customer.

The 360 view enables the business to segment and target by customers’ engagement history instead of by their demographics. It is particularly useful for running micro campaigns which generally yield  higher ROI’s than conventional larger scale campaigns.

Read more on this topic

What is the relationship between data science, big data, and business intelligence?

Data science implements advanced analytical techniques drawn from the fields of mathematics, statistics, operations research, information science, and computer science. The objective of the analysis is to discover data patterns that provide a better understanding of a business process, in order that future actions can be planned and executed with an optimized degree of accuracy. Generally, data science methodologies can be applied to data sets of any size, from very small to very large.

Big data embodies software technologies that are used to enable data science analytics when the data sets are too large for a single compute platform. In this situation, the analytics software partitions the large data set into an ensemble of smaller parcels, with each individual parcel paired to an individual computer, within a network of computers. The parcels are then all analytically processed at the same time, using a memory-efficient parallel compute strategy.  Typically, the technology is also supplemented by specialized database systems designed to maintain integrity of the distributed dataset by coordinating storage and retrieval actions involving the parcels. These are often referred to as NoSQL database architectures, and employ table structures that are more flexible than those found in traditional databases.

Business intelligence is a much broader term typically referring to the overall process by which data is gathered, arranged in databases, analytically transformed for knowledge discovery, formatted for presentation, and then acted upon by the organizational end user. Big data and modern data science technologies are a part of this, but it can also include older technologies based on traditional data warehousing concepts. The approach can include datasets from separate knowledge domains, both internal and external to the organization, and utilize a broad combination of legacy and modern software technologies. These components are merged per a knowledge management policy defining how they must be organized to provide the required added value for optimized decision making.

What does a data science solutions provider do?

A data science solutions provider specializes in delivering highly effective business advice based on data-driven insights, derived from the use of predictive models and analytical software technologies.  The provider employs specialized technical skills and domain knowledge, to identify big data opportunities that are aligned with the business goals of the client. During this process, advanced data science techniques are used to discover patterns in marketplace dynamics and customer behavior that enable the client to anticipate future business opportunities. Using this prior knowledge, the client can then execute actions that will strategically position their business in order to exploit these opportunities for optimum profit and competitive advantage.

The expertise offered by a data science provider typically involves a combination of skill sets, including business and marketing science, software design, database development, statistical analysis, and machine learning. By collaborating as a multidisciplinary team, the resident specialists typically engage the following activities that are focused on delivering maximum value to clients:

  • Design and implement machine learning software algorithms that are programmed to mine the big data stores and produce a detailed analysis for vetting by business and marketing experts.
  • Apply marketing and business expertise to interpret discovered trends and develop a data-driven business strategy for clients, which will enable them to leverage maximum benefit from findings.
  • Implement data hygiene protocols by cleaning and validating the data to ensure that it is accurate, complete, and internally consistent with respect to original collection methodologies.
  • Engage data exploration and interpretation to identify analytics opportunities and insights that offer the greatest benefit for the client, and apply data-driven techniques in developing solutions.
  • When dealing with limited data, apply statistical methods to identify business questions that the available data can answer, thus maximizing the information potential for the data set.
  • Collect large sets of structured and unstructured data from various sources, identifying and reformatting those of best quality, and then determine the variables with best predictive power.
  • Design Java software architectures for managing distributed computing and database storage applications, and for networked communications involving data feeds from real-time sources.
  • Maintain comprehensive knowledge of a broad range of statistical and machine learning software tools, so that blended solutions can be designed using the best advantages of each tool.
  • Develop data visualization software strategies that intuitively communicate important findings to stakeholders from various non-technical backgrounds.
What are the different types of data analysis?

There are six main types of analysis that can be employed to extract value-added information from data. The list below presents them in increasing order of complexity. However, this does not reflect order of usage in any given study; in fact, most studies will only employ a few of the methods, and not all of them. The types are summarized as follows:

  • Descriptive Analysis: In this basic step, statistical properties of the data are calculated and explored via visualization graphics; this could also include an examination of distributional properties such as extreme values.
  • Diagnostic Analysis: This method focuses on detecting previously unknown relationships via cross- correlation analysis and cluster models that reveal interactions between variables in the data set.
  • Inferential Analysis: Here, properties of a small data sample are extrapolated to inferences about its overall population; error margins are also computed for the population values being inferred.
  • Predictive Analysis: This involves designing models that learn patterns by analyzing existing data examples. These patterns are then used for more accurate predictions about future outcomes.
  • Causal Analysis: The focus is to determine how one variable can alter another based on patterns in the data set. This measures how one variable might change when another one varies in value.
  • Mechanistic Analysis: This is similar to Causal Analysis, except that exact relationships between variables are captured by an explicit mathematical equation, rather than a model trained on data.
How do I gauge the quality of a dataset?

Big Data algorithms cannot be any more accurate than the data used to train them. If the data is sub-par in any way, then decisions that are made based on the analysis will be inherently flawed.

Assess the quality of your data based on these criteria:

  • Validity: make sure that the data set has all the relevant input variables required for the analytical model to produce best results.
  • Completeness: determine the extent to which there are missing and/or incorrect entries, and estimate level of effort needed for corrections.
  • Consistency: consider that business rules might have changed during the collection period, thus rendering earlier data inconsistent with later data.
  • Accuracy: ensure that the data is from a sample that is large enough to realistically represent the subject being modeled by the analytics.
  • Timeliness: confirm that any data from the distant past is not too outdated to be relevant, if it is to be used to make predictions about the future.

Read more on this topic

At what point do I engage external data science providers?

Organizations that have previously realized value from using spreadsheets, and other similar small-scale analytical methods, will ultimately develop a need to accomplish even more with the data sources available. As they aim for higher accuracy, faster processing, and specific insights, the following are indicators of the need to move towards using structured big data methodologies:

  • It takes too long to generate the desired ad hoc reports because volume and velocity (speed of arrival) of data overwhelms the current processing system and analytical methods.
  • Standard spreadsheet applications can no longer address the business questions with sufficient granularity, specificity, or timeliness.
  • There is a need to combine data from separate silos of different formats, different granularity, and different sources, to form multi-structured data sets.
  • When unstructured data is to be mined for sentiment content, such as text messages, comments from social media streams and contact log/notes/recordings from the call centre.

Overall, you make the shift when there is much more data than can be managed by your existing software tools. From realizing this need to establishing an in-house team, there is a transition period during which it is best to seek help externally from seasoned experts. Smaller businesses may find it more cost efficient to continue to use external providers.

What are Predictive Model Factories?

Businesses today are routinely acquiring vast amounts of data, often from a broad range of sources. This presents tremendous opportunities for gaining unprecedented insights into matters critical for growth and profitability. However, traditional analytics cannot always accommodate the extremely large scale and rapid prototyping needed to get the strongest competitive advantage in the shortest amount of time possible.

Predictive Model Factories upscale the capabilities of big data analytics by enabling an extremely large number of individual models to be simultaneously trained in an automatic manner, without operator intervention. This permits a much greater number of models to be produced without requiring extra resources, and offers several benefits:

  • Model retraining cycles can be iterated much more frequently, especially during overnight sessions when computer resources are otherwise lightly used.
  • Allows a business to develop a separate model for each customer, to predict future buying preferences or timing.
  • In a majority of cases, models for newly acquired customers can be initiated and trained automatically, without any manual intervention whatsoever.
  • This also ensures just-in-time data discovery in very large and continually refreshed data sets, allowing the analytic model to learn, adapt and maintain predictive accuracy as new data is added.

As an example, Cisco Systems has deployed propensity-to-buy (P2B) models on a modest compute cluster of 4 computers, with 24 cores overall and 128GB of combined memory. This basic arrangement can calibrate 60,000 P2B models in a matter of hours, representing an overall efficiency gain that is 15x faster than their traditional methods.

In business, relevance and agility to adapt are key success drivers. Parallel computing and on-demand processing power enable a business to easily integrate these insights into the business workflow, and thus compete successfully.

What is machine learning and how is it used?

Machine learning is a data science method that uses existing data sets to train software algorithms that can learn how to make predictions about future outcomes. During training, the algorithm will discover previously unknown patterns in the data that will enable it to construct rules as to why each event in the data set has occurred. Once these rules have been built, the algorithm can then be used to predict similar future events when used to process other data sets taken from the same business problem area. Machine learning procedures typically produce predictions that are of high value to business planning decisions in many areas of application, including the following:

  • Regression to predict a future numerical value based on values of present observations. For example, using time series analysis to predict next week’s sales based on those for this week.
  • Classification to assign items to a group in which members have similar attributes. For example, grouping customers by product preference to identify those likely to respond to email advertising.
  • Unsupervised clustering to discover natural groups that can emerge from the data. For example, using demographic and socioeconomic data to discover natural segmentations in customer data.
  • Pattern discovery that correlates the purchase of one product with another. This is used in cross-sell engines which offer a possible new purchase item to a customer, based on earlier purchase patterns.

For additional information on uses of machine learning, please see the FAQ answer for ‘What are some typical applications for predictive models?’

What is metadata and how is it related to big data?

Metadata refers to small granules of data used to capture basic knowledge that provides a descriptive overview about very large volumes of data. It is higher level information that is used to organize, locate, and otherwise manipulate extensive data sets when it is not practical to work directly with that data itself. Aspects that metadata can summarize about large data sets include structure, content, quality, context, ownership, origin, and condition, amongst others. Because the metadata is much smaller than the data it describes, it acts as a search index to facilitate quick identification and retrieval of archived data sets that are being sought for a given business objective.

In the modern era of big data, the role of metadata data has become much more critical than it has been in the past. It is now very important for businesses to manage the continuously-growing volumes of structured and unstructured data, in order that they can efficiently leverage it to maintain competitive advantage. For instance, semi-structured and unstructured data is often spread across many different storage devices and locations, can be stored in a diversity of formats, and is difficult to organize overall. Consequently, cost-effective usage and management can only be achieved if there exists a metadata oversight program, that minimizes the time and expenditures associated with discovery of the relevant in-house sets of big data.

As data volume and diversity continues to grow, for each new big data project that is launched there will be an ever-increasing efficiency imperative that the relevant data sets be identified through searching of a descriptive and accurate metadata layer. In fact, at the final reporting phase, metadata summaries can provide an audit trail to authenticate the quality of source data from which the analytic findings are drawn.

How does AlgoTactica differentiate itself from larger data science providers?

Many large data science providers are focused on offering prepackaged solutions which involve combining commercial-off-the-shelf (COTS) analytic software products offered by multiple third-party vendors. Depending on the stated needs of the client, this might involve delivery of an analytics platform based on well-known COTS software components, delivery of COTS distributed database systems to replace existing legacy relational databases, or some other similar combination. In all these cases, the larger provider is actually performing in the role of a systems integrator, as opposed to an OEM provider of custom-designed analytics solutions.

Although the large provider might well have data science knowledge in-house, when it comes to addressing a client’s need, that knowledge is likely to become focused on identifying prepackaged software that will ultimately prove to be only an approximate solution and not necessarily the best fit. Given the scarcity of available professionals with advanced data science skill sets, large providers that are focused on maximizing sales volume cannot acquire a talent pool sufficiently large as to facilitate detailed investigation of each client’s specific data science need. This ultimately leads to proposed solutions that are commoditized to appeal to a large range of clients, even though the solution might not be precisely conformant with the exact needs of any individual client.

At AlgoTactica, we focus on investigating each client’s data problem in detail, and then proposing the appropriate approach that will yield the optimal solution. After an exploratory data analysis (EDA) stage, we engage a scientific due diligence process during which a uniquely relevant set of algorithmic candidates are evaluated against the client’s data to identify the best one. Once identified, we can quickly build a one-of-a-kind customized product by compiling software from our in-house mathematical algorithm libraries. Therefore, the ultimate solution is designed specifically to the needs of the individual client and will not attempt to be an omnibus solution for a commoditized marketplace.

The principals at AlgoTactica have graduate degrees in specialized fields of marketing science and engineering mathematics.  Furthermore, we have decades of combined experience in market development, design of analytics software, and data science involving machine learning and statistical analysis.  We have built our professional careers by maintaining an awareness of the latest innovative advances in our field, and then leveraging those innovations for delivery of highly-customized solutions to well-known industrial brand names.

How is data science used across the business organization?

Each engagement event between a business and a customer yields information that can be organized in a database, and subsequently analyzed to build an informative data model.  A professionally designed data analytics strategy will model the events, analyze them to learn from past and present business patterns, and then use this prior knowledge to predict future trends. By using this information, it is then possible to design a business plan that anticipates evolving customer preferences and uniquely identifies emerging opportunities in a dynamic marketplace.

The initial stage of the data analytics process involves the development of descriptive models, which are used to discover previously unknown relationships hidden in the business data record. Based on the insights acquired, predictive models are then designed which utilize these newly-discovered historical patterns, to make forecasts that answer questions about future outcomes.

What are some typical applications for predictive models?

As awareness grows regarding the benefits of predictive analytics, these methods are being applied to an increasingly broad range of business-related problems. Development of big data software technologies, coupled with the availability of more economical data processing hardware, is driving the application of predictive modelling across many industries, including health care, insurance, manufacturing, retail, and numerous others. Here are several examples of how predictive analytics is typically being used in industry:

  1. Churn Prevention: predictive analytics can help prevent customer churn by providing early warning of customers and customer segments most at risk of taking their business elsewhere.
  2. Optimizing Business Processes: retailers can optimize inventory acquisition based on predictions generated via data from previous purchase patterns, social media trends, and weather forecasts.
  3. Customer Lifetime Value: by identifying which customers are likely to spend the most money over the longest time, retailers can focus their marketing efforts to ensure that these people stay loyal.
  4. Propensity to Engage: email marketing is better targeted via use of propensity-to-engage models that identify customers most likely to click on an email link offering advertising promotions.
  5. Sentiment Analysis: natural language processing models detect social media commentary having negative sentiment towards a brand, so that action can be taken to improve product reputation.
  6. Next Sell Recommendation: analyze customer buying patterns in order to recommend a next item that a customer could be enticed to buy as a follow-on from one of their recent purchases.
  7. Brand Based Segmentation: using analytics to classify customers according to which brands they prefer, enables marketing to precisely pitch new product releases to customers most likely to buy.
  8. Quality Assurance: quality issues and trends are detected before they become critical deficiencies that can diminish customer satisfaction and lead to decreased revenues due to lost market share.
  9. Predictive Maintenance: by analyzing maintenance data for capital equipment, future timelines for upkeep actions can be planned, thus avoiding loss of business due to unanticipated downtime.
  10. Insurance Pricing: underwriters use analytics to accurately price insurance policies in proportion to risk exposure, thereby avoiding potential losses caused by underpricing relative to level of risk.

What defines a model in data science?

A model typically consists of a mathematical procedure designed to provide answers to a specific business problem. The role of a model is to operate on business data in order that relationships within the data are used to anticipate future outcomes.  Data-driven business decisions can then be made by acting on the knowledge acquired from those predictions

In general terms, a model will make use of input variables that have the power to anticipate the outcome of some other variable that is dependent on them. As part of its design stage, a model will be subjected to a training process involving a data set that is used to teach it about relationships that exist within the data. During that training, the model will build internal mathematical rules that specify how it should formulate predictions when given new input data on which it has not previously been trained.

Once its mathematical rules are established, the reliability of the model is formally vetted by examining performance error measures and by further testing against additional data that was not used in training.  This is done to ensure that the model can generalize its learned rules in a way that enables it to perform accurately when exposed to previously unseen data.  In some instances, a model will learn rules that are not sufficiently flexible for generalization, and will exhibit very accurate performance when given data that was used to train it, but will show very poor performance on any other data. In such a case, the model is described as having been over-fitted, and will need to be retrained in a more careful fashion.

What are the main types of learning used to train predictive models?

There are many ways in which data science models can be defined and categorized, especially with respect to mathematical complexity, time required for training, as well as underlying theoretical assumptions. An often-used categorization is with respect to the four main types of learning that can be used to train them:

  1. Supervised Learning: in this method, the model is trained to recognize which value the dependent output variable will take in response to each combination of values taken by the independent input variables. In a typical training data set, each data row will consist of a dependent output value and a list of its associated independent input values along that row. During training, the model examines each row and accumulates mathematical structure defining the relationship between the independent input and dependent output variables. Because the dependent variable has been provided, the training data is often referred to as a labelled set. The dependent labels can be numerical values when models are trained to make quantitative predictions, or class labels when the model is trained to classify input data regarding membership in a range of possible classes.
  2. Unsupervised Learning: with this approach, the model is provided a training set in which the rows have values for the input variables, but these are not accompanied by any dependent output variable value. For this type of model, rather than being prompted to learn relationships enforced by labels, there is an alternate objective to discover relationships that will emerge naturally from the data, as the model iterates over the training set. This method can be used to identify naturally-occurring segmentation classes for which a labelling strategy can then be devised; in other cases, it can be used to evolve structures which reduce multi-dimensional data to a two-dimensional space where it can be more easily visualized.
  3. Semi-Supervised Learning: when labelled data is available in only small amounts, it is sometimes possible to supplement the training process by also using unlabelled data. There are several methods for doing this, but the most basic involves initially training the model on the labelled data only. After this, the trained model is used to assign labels to the previously unlabelled components, and provide a confidence level estimate associated with the labelling action. Cases for which the model provided a high degree of confidence can then be added to the original training set, along with the label assigned by the model, to form a larger labelled training set to be used in a second learning cycle.
  4. Reinforcement Learning: this is typically an iterative discovery process whereby, at each iteration, the model tries a new combination of mathematical parameter values and then receives feedback as to whether these values led to an improvement. The model incrementally explores a parameter space that is typically multidimensional, and uses an accuracy metric to determine if it is following the right path to an optimal solution; if the accuracy decreases along this path, then the model will try an alternate path. Although this is often identified as a separate learning method, it might be argued that it has significant overlap with the unsupervised learning category described above.

Find out what your data could do for you. Contact us today for a free and informative consultation: Now

Back to top