Industries Needs: DATA SCIENCE TECHNOLOGY STACK

THREE MANAGEMENT LAYERS

3.0 OBJECTIVES

• The objective is to explain in detail the core operations of the Three Management Layers i.e. Operational Management Layer, Audit, Balance, and Control Layer& the Functional Layers.

3.1 INTRODUCTION

• The Three Management Layers are a very important part of the framework.

• They watch the overall operations in the data science ecosystem and make sure that things are happening as per plan.

• If things are not going as per plan then it has contingency actions in place for recovery or cleanup.

3.2 OPERATIONAL MANAGEMENT LAYER

• Operations management is one of theareasinside the ecosystem responsible for designing and controlling the process chains of a data science environment.

• This layer is the center forcomplete processing capability in the data science ecosystem.

• This layer stores what you want to process along with every processing schedule and workflow for the entire ecosystem.

This area enables us to see an integrated view of the entire ecosystem. It reports the status each and every processing in the ecosystem. This is where we plan our data science processing pipelines.

• We record the following in the operations management layer:

• Definition and Management of Data Processingstream

• Eco system Parameters

• Overall Process Scheduling

• Overall Process Monitoring

• Overall Communication

• Overall Alerting

Definition and Management of Data Processing stream:

• The processing-stream definitions are the building block of the data science environment.

• This section of the ecosystemstores all currently active processing scripts.

• Management refers to Definition management, it describes the workflow of the scripts throughout the ecosystem, it manages the correct execution order according to the workflow designed by thedata scientist.

Eco system Parameters:

• The processing parameters are stored in this section, here it is made sure that a single location is made available for all the system parameters.

• In any production system, for every existing customer, all the parameters can be placed together in a single location and then calls could be made to this location every time the parameters are needed.

• Two ways to maintain a central location for all parameters are: 1. Having a text file which we can import into every processing script. 2. A standard parameter setup script that defines aparameter database which we can import into every processing script.

• Example: an ecosystem setup phase

Overall Process Scheduling:

• Along with other things the scheduling plan is stored in this section, it enables a centralized control and visibility of the complete scheduling plan for the entire system.

• One of the scheduling methods is a Drum-Buffer-Rope method.

The Drum-buffer-rope Method:

• It is a standard practice to identify the slowest process among all.

• Once identified it is then used to control the speed of the complete pipeline

• This is done by tying or binding the remaining processes of the pipeline to this process.

• The method implies that

• the “drum” is placed at the slow part of the pipeline, to give the processing pace,

• the “rope” is attached to all the processes from beginning to end of the pipeline, this makes sure that no processing is done that is not attached to the drum. 25

• This approach ensures that all the processes in the pipelinecomplete more efficiently, as no process is entering or leaving the process pipeline without been recorded by the drum’s beat.

Process Monitoring:

• The central monitoring process makes sure that there is a single unified view of the complete system.

• We should always ensure that the monitoringof our data science is being done from a single point.

• With no central monitoring running different data science processes on the same ecosystem will make managing a difficult task.

Overall Communication:

• The Operations management handles all communication from the system, it makes sure that any activities that are happening are communicated tothe system.

• To make sure that we have all our data science processes trackedwe may use a complex communication process.

Over all Alerting

• The alerting section of the Operations management layeruses communications to inform the correct status of the complete system to the correct person, at the correct time.

3.3 AUDIT, BALANCE, AND CONTROL LAYER

• Any process currently under executing is controlled by the audit, balance, and control layer.

• It is this layer that has the engine that makes sure that every processing request is completed by the ecosystem according to the plan.

• This is the only area where you can observe which processes are is currently running within your data scientist environment.

• It records the following information:

• Process-execution statistics

• Balancing and controls

• Rejects- and error-handling

• Fault codes management

3.3.1 Audit:

• An audit refers to an examination of the ecosystem that is systematic and independent

• This sublayer records which processes are running at any given specific point within the ecosystem.

• Data scientists and engineers use this information collectedto better understand and plan future improvements to the processing to be done.

• the audit in the data science ecosystem, contain of a series of observers which record prespecified processing indicators related to the ecosystem.

The following are good indicators for audit purposes:

• Built-in Logging

• Debug Watcher

• Information Watcher

• Warning Watcher

• Error Watcher

• Fatal Watcher

• Basic Logging

• Process Tracking

• Data Provenance

• Data Lineage

Built-in Logging:

• It is always a good thing to design our logging in an organized prespecifiedlocation, this ensures that we capture every relevant log entry in one location.

• Changing the internal or built-in logging process of the data science tools should be avoid as this complicate any future upgrades complex and will prove very costly to correct.

• A built-in Logging mechanism along with a cause-and-effect analysis system allows you to handle more than 95% of all issues that can rise in the ecosystem.

• Since there are five layers it would be a good practice to have five watchers for each logging locations independent of each other as described below:

Debug Watcher:

• This level of logging is the maximum worded logging level.

• Any debug logs if discovered in the ecosystem, it should raise an alarm, indicating that the tool is using some precise processing cycles to perform the necessary low-level debugging.

Information Watcher:

• The information watcher logs information that is beneficialtothe running and management of a system.

• It is advised that these logs be piped to the central Audit, Balance, and Control data store of the ecosystem.

Warning Watcher:

• Warning is usually used for exceptions that are handled or other important log events.

• Usually this means that theissue was handled by the tool andalso took corrective action for recovery.

• It is advised that these logs be piped to the central Audit, Balance, and Control data store of the ecosystem.

Error Watcher:

• An Error logs all unhandled exceptions in the data science tool.

• An Error is a state of the system. This state is not good for the overall processing, since it normally means that a specific step did not complete as expected.

• In case of an error the ecosystem should handle the issue and take the necessary corrective action for recovery.

• It is advised that these logs be piped to the central Audit, Balance, and Control data store of the ecosystem.

Fatal Watcher:

• Fatal is a state reserved for special exceptions or conditions for which it is mandatory that the event causing this state be identified immediately.

• This state is not good for the overall processing, since it normally means that a specific step did not complete as expected.

• In case of an fatal error the ecosystem should handle the issue and take the necessary corrective action for recovery.

• It is advised that these logs be piped to the central Audit, Balance, and Control data store of the ecosystem.

• Basic Logging: Every time a process is executed this logging allows you to log everything that occursto a central file.

Process Tracking:

• For Process Tracking it is advised to create a tool that will perform a controlled, systematic and independent examination of the process for the hardware logging.

• There may be numerous server-based software that monitors system hardware related parameters like voltage, fan speeds, temperature sensors and clock speeds of a computer system.

• It is advised to use the tool which your customer and you bot are most comfortable to work with.

• It is also advised that logs generated should be used for cause-and[1]effect analysis system.

Data Provenance:

• For every data entity all the transformations in the system should be tracked so that a record can be generated for activity.

• This ensures two things: 1. that we can reproduce the data, if required, in the future and 2. That we can supply a detailed history of the data’s source in the system throughout its transformation.

Data Lineage:

• This involves keeping records of every change whenever it happens to every individual data value in the data lake.

• This help us to figure out the exact value of any data item in the past.

• This is normally accomplished by enforcing a valid-from and valid-to audit entry for every data item in the data lake.

3.3.2 Balance:

• The balance sublayer has the responsibility to make sure that the data science environment is balanced between the available processing capability against the required processing capability or has the ability to upgrade processing capability during periods of extreme processing.

• In such cases the on-demand processing capability of a cloud environment becomes highly desirable.

3.3.3 Control:

• The execution of the current active data science processes is controlled by the control sublayer.

• The control elements of the control sublayer are a combination of:

• the control element available in the Data Science Technology Stack’s tools and

• a custom interface to control the overarching work.

• When processing pipeline encounters an error, the control sublayer attempts a recovery as per our prespecified requirements else if recovery does not work out it will schedule a cleanup utility to undo the error.

• The cause-and-effect analysis system is the core data source for the distributed control system in the ecosystem.

3.4 YOKE SOLUTION

• The yoke solution is a custom design

• Apache Kafka is developed as an open source stream processing platform. Its function is to deliver a platform that is unified, has high[1]throughput and low-latency for handling real-time data feeds.

• Kafka provides a publish-subscribe solution that can

• handle all activity-stream data and processing. The Kafka environment enables you to send

• messages between producers and consumers that enable you to transfer control between different parts of your ecosystem while ensuring a stable process.

3.4.1 Producer:

• The producer is the part of the system that generates the requests for data science processing, by creating structures messages for each type of data science process it requires.

• The producer is the end point of the pipeline that loads messages into Kafka.

3.4.2 Consumer:

• The consumer is the part of the process that takes in messages and organizes them for processing by the data science tools.

• The consumer is the end point of the pipeline that offloads the messages from Kafka.

3.4.3 Directed Acyclic Graph Scheduling:

• This solution uses a combination of graph theory and publish[1]subscribe stream data processing to enable scheduling.

• You can use the Python NetworkX library to resolve any conflicts, by simply formulating the graph into a specific point before or after you send or receive messages via Kafka.

• That way, you ensure an effective and an efficient processing pipeline

3.4.4 Cause-and-Effect Analysis System

• The cause-and-effect analysis system is the part of the ecosystem that collects all the logs, schedules, and other ecosystem-related information and

• Enables data scientists to evaluate the quality of their system.

3.5 FUNCTIONAL LAYER

• The functional layer of the data science ecosystem is the largest and most essential layer for

• Programming and modeling. Any data science project must have processing elements in this

3.6 DATA SCIENCE PROCESS

• Following are the five fundamental data science process steps:

• Begin process by asking a What if question

• Attempt to guess at a probably potential pattern

• Create a hypothesis by putting together observations

• Verify the hypothesis using real-world evidence

• Promptly and regularly collaborate with subject matter experts and customers as and when you gain insights

• Begin process by asking a What if question–Decide what you want to know, even if it is only the subset of the data lake you want to use for your data science, which is a good start.

• Create a hypothesis by putting together observations–Use your experience or insights to guess a pattern you want to discover, to uncover additional insights from the data you already have

• Gather Observations and Use Them to Produce a Hypothesis –A hypothesis, it is a proposed explanation, prepared on the basis of limited evidence, as a starting point for further investigation.

• Verify the hypothesis using real-world evidence- Now, we verify our hypothesis by comparing it with real-world evidence

• Promptly and regularly collaborate with subject matter experts and customers as and when you gain insights– Things that are communicated with experts may include technical aspects like workflows or more specifically data formats & data schemas.

• Data structures in the functional layer of the ecosystem are:

• Data schemas and data formats: Functional data schemas and data formats deploy onto the data lake’s raw data, to perform the required schema-on-query via the functional layer.

• Data models: These form the basis for future processing to enhance the processing capabilities of the data lake, by storing already processed datasources for future use by other processes against the data lake.

• Processing algorithms: The functional processing is performed via a series of well-designed algorithms across the processing chain.

• Provisioning of infrastructure: The functional infrastructure provision enables the framework to add processing capability to the ecosystem, using technology such as Apache Mesos, which enables the dynamic previsioning of processing work cells.

• The processing algorithms and data models are spread across six supersteps for processing the data lake.

1. Retrieve: This super step contains all the processing chains for retrieving data from the raw data lake into a more structured format.

2. Assess: This super step contains all the processing chains for quality assurance and additional data enhancements.

3. Process: This super step contains all the processing chains for building the data vault.

4. Transform: This super step contains all the processing chains for building the data warehouse from the core data vault.

5. Organize: This super step contains all the processing chains for building the data marts from the core data warehouse.

6. Report: This super step contains all the processing chains for building virtualization and reporting of the actionable knowledge.

3.7 REVIEW QUESTION

1. Explain in detail the function of Operational Management Layer

2. Give an overview of the Drum-buffer-rope Method

3. Give an overview of the functions of Audit, Balance, and Control Layer

4. Explain the different ways of implementing the Built-in Logging in the Audit phase.

5. Explain the different ways of implementing the Basic Logging in the Audit phase.

6. Explain Directed Acyclic Graph Scheduling

7. List & Explain the data structures in the functional layer of the ecosystem

8. Explain the fundamental data science process steps

9. List the super steps for processing the data lake.

3.8 REFERENCES

Andreas François Vermeulen, “Practical Data Science - A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets”

Friday, February 11, 2022

DATA SCIENCE TECHNOLOGY STACK

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS