RAPID INFORMATION
FACTORY (RIF) ECOSYSTEM
Rapid Information Factory (RIF) System is a technique and tool which is used for processing the data in the development. The Rapid Information Factory is a massive parallel data processing platform capable of processing theoretical unlimited size data sets.
The Rapid Information
Factory (RIF) platform supports five high[1]level
layers:
• Functional Layer:
The functional layer is the core processing capability of the factory. Core functional data processing methodology is the R-A-P-T-O[1]R framework.
• Retrieve Super Step
The retrieve super step supports the interaction between external data sources and the factory.
• Assess Super Step.
The assess super step supports the data quality clean-up in the factory.
• Process Super Step.
The process super step converts data into data vault.
• Transform Super
Step.
The transform super step converts data vault via sun modeling into dimensional modeling to form a data warehouse.
• Organize Super
Step.
The organize super step sub-divides the data warehouse into data marts.
• Report Super Step.
The report super step is the Virtualization capacity of the factory.
• Operational
Management Layer.
• Audit, Balance and
Control Layer.
• Utility Layer.
Common components supporting other layers.
• Maintenance Utilities.
• Data Utilities.
• Processing Utilities.
Business Layer:
Contains the business requirements (Functional and Non-functional)
Data Science Storage
Tools:
• Data Science ecosystem has a bunch of series of tools which are used to build your solution. By using this tools and techniques you will get rapid information in advanced for its better capability and new development will occur each day.
• There are two basic data processing tools to perform the practical of data science as given below:
Schema on write
ecosystem:
• Traditional Relational Database Management System requires a schema before loading the data. Schema basically denotes the organizational data which is like a blueprint, describing how the database should be constructed.
• Schema is a single structure which represents logical view of entire database. It represents how the data is organized and related between them.
• It is responsible of the database designer to design the database perfect to understand the logic and structure with the help of programmer.
• Relational Database Management System is used and designed to store the data.
• To Retrieve the data from the relational database system, you need to run the specific structure query language to perform these tasks.
• A traditional database management system only works with schema and it will work once the schema is described and there is only one point of view to describe and view the data into the database.
• It stores a dense of data and all the data are stored into the datastore and schema on write widely use methodology to store the dense data.
• Schema on write schemas are build with the purpose which makes them change and maintain the data into the database.
• When there is a lot of raw data which are available for the processing, during, some of the data are lost and it makes them weak for future analysis.
• If some important data are not stored into the database then you cannot process the data for further data analysis.
Schema on read
ecosystem:
• Schema on read ecosystem does not need schema, without this you can load the data into the database.
• This type of schema stores the minimal data with values into the database and some of the important schema are applied during the query phase.
• It has the capabilities to store the structure, semi-structure, unstructured data and it has potential to apply most of the flexibilities when we request the query during the execution.
• These types of ecosystem are applicable for both experimental and exploration of data to retrieve the data from the schema or structure.
• Schema on read generate the fresh and new data and increase the speed of data generation as well as reduce the cycle time of data availability of actionable information.
• These types of ecosystem that means schema on read and schema on write are very useful and essential for data scientist and engineering personal for better understanding about data preparation, modeling, development, and deployment of data into the production.
• When you apply schema on read on structure, un-structure, and semi[1]structure, it would generate very slow result because it does not have the schema fast retrieval of data into the data warehouse.
• Schema on read follow the agile way of working and it has capabilities and potential to work likeNoSQL database as it works in the environment.
• Some time schema on read through the error during the query time because there are three type of data stored into the database like structure, un-structure, and semi-structure. There is no better process and rule and regulation for fast and better retrieval of data from structure database.
Data Lake:
• A Data Lake is storage repository of large amount of raw data that means structure, semi-structure, unstructured data.
• This is the place where you can store three types of data structure, semi-structure, unstructured data with no fix amount of limit and storage to store the data.
• If we compare schema on write and data lake then we will find that schema on write store the data into the data warehouse in predefined database on the other hand data lake store the less data structure to store the data into the database.
• Data Lake follow to store less data into the structure database because it follows the schema on read process architecture to store the data.
• Data Lake allow us to transform the raw data that means structure, semi-structure, unstructured data into the structure data format so that SQL query could be performed for the analysis.
• Most of the time data lake is deployed by using the distributed data object storage database which enable the schema on read so that business analytics and data mining tools and algorithms can be applied on the data.
• Retrieval of data is so fast because there is no schema applied. Data must be access without any failure or any complex reason.
• Data Lake is similar to real time river or lake where the water comes from different- different places and at the last all the small- small river and lake are merged into the big river or lake where large amount of water are stored, whenever there is need of water then it can be used by anyone.
• It is low cost and effective way to store the large amount of data stored into centralized database for further organizational analysis and deployment.
• Data Vault is a database modeling method which is designed to store the long-term historical storage amount of data and it can controlled by using the data vault.
• In Data Vault, data must come from different sources and it is designed in such a ways that data could be loaded in parallel ways so that large amount of data implementation can be done without any failure or any major design.
• Data Vault is the process of transforming the schema on read data lake into schema on write data lake.
• Data Vault are designed schema on read query request and after that it would be converted into the data lake because schema on read increase the speed of generating new data for the better analysis and implementation.
• Data Vault store a single version of data and does not distinguish between good data and bad data.
• Data Lake and Data Vault are built by using the three main component or structure of data i.e.Hub, Link and satellite.
2. Hub :
• Hub has unique business key with low amount of data to be changed and meta data that means data is the main source of generating the hubs.
• Hub contains surrogate key for each metadata information and each hub items i.e. origin of this business key.
• Hub contains a set of unique business key that will never change over a period manner.
• There are different types of hubs like person hub, time hub, object hub, event hub, locations hub. The Time hub contains ID Number, ID Time Number, ZoneBasekey, DateTimekey, DateTimeValue and all these links are interconnected to each other like Time-Person, Time-Object, Time-Event, Time-Location, Time-Links etc.
• The Person hub contains IDPersonNumber, FirstName, SecondName, LastName, Gender, TimeZone, BirthDateKey, BirthDate and all these links are interconnected to each other like Person-Time, Person[1]Object, Person-Location, Person-Event, Person-Link etc.
• The Object hub contains IDObjectNumber, ObjectBaseKey, ObjectNumber, ObjectValue and all these links are interconnected to each other like Object-Time, Object-Link, Object-Event, Object-Location, Object-Person etc.
• The Event hub contains ID Event Number, Event Type, Event Description and all these links are interconnected to each other like Event-Person, Event-Location, Event-Object, Event-Time etc.
• The Location hub contains ID Location Number, Object Base Key, Location Number, Location Name, Longitude and Latitude all these links are interconnected to each other like Location-Person, Location-Time, Location-Object, Location-event etc.
Link :
• Link plays a very important role during transaction and association of business key. The Table relate to each other depending upon the attribute of table like that one to one relationship, one to many relationships, Many to One relationship, Many to many relationships.
• Link represent and connect only element in the business relationships because when one node or link relate to one or another link on that time data transfers smoothly.
Satellites:
• When the hubs and links produce and form the structure of satellites which store no chronological structure of data means then it would not provide the information about the mean, median, mode, maximum, minimum, sum of the data.
• Satellites are the strong structure of data that store a detailed information about the related data or business characteristics key and stores large volume of data vault.
• The combinations of all these three i.e. hub, link, and satellites are formed together to help the data analytics and data scientists and data engineer to store the business structure, types of information or data into it.
Data Science
Processing Tools:
• The process of transforming the data, data lake to data vault and then transferring the data vault into data warehouse.
• Most of the data scientist and data analysis, data engineer uses these data science processing tool to process and transfer the data vault into data warehouse.
1. Spark :
• Apache Spark is an open source clustering computing framework. The word open source means it is freely available on internet and just go on internet and type apache spark and you can get freely source code, you can download and use according to your wish.
• Apache Spark was developed at AMP Lab of university of California, Berkeley and after that all the code and data was donated to Apache Software Foundation to keep doing changes over a time and make it more effective, reliable, portable that will run on all the platform.
• Apache Spark provide an interface for the programmer and developer to directly interact with the system and make data parallel and compatible with data scientist and data engineer.
• Apache Spark has the capabilities and potential, process all types and variety of data with repositories including Hadoop distributed file system, NoSQL database as well as apache spark.
• IBM are hiring most of the data scientist and data engineer, who has more knowledge and information about apache spark project so that innovation could perform an easy way and will come up with more feature and changing.
• Apache Spark has potential and capabilities to process the data very fast and hold the data in memory and transfer the data into memory data processing engine.
• It has been built on top of the Hadoop distributed file system which make the data more efficient, more reliable and make it more extendable on the Hadoop map reduce.
2. Spark Core:
• Spark Core is base and foundation for over all of the project development and provide some most important Information like distributed task, dispatching, scheduling and basic Input and output functionalities.
• By using spark core, you can have more complex queries that will help us to work with complex environment.
• The distributed nature of spark ecosystem enables you the same processing data on a small cluster, to go for hundreds or thousands of nodes without making any changes.
• Apache Spark uses Hadoop in two ways one is storage and second one is for processing purpose.
• Spark is not a modified version of Hadoop distributed file system, because it depend upon the Hadoop which has its own feature and tool for data storage and data processing.
• Apache Spark has a lot of feature which makes it compatible and reliable. Speed is one the most important feature of spark that means with the help of spark, your application are able to run on directly on Hadoop and it is 100 times much faster in the memory.
• Apache Spark Core support many more language and it has its own built in function and API in java, Scala, python that means you can write the application by using the java, python, C++, Scala etc.
• Apache Spark Core has come up with advanced analytics that means it does not support the map and reduce the potential and capabilities to support SQl Queries, Machine Learning and graph Algorithms.
3. Spark SQL:
• Spark SQL is a component on top of the Spark Core that presents data abstraction called Data Frames.
• Spark SQL is fast clustering data abstraction, so that data manipulation can be done for fast computation.
• It enables the user to run SQL/HQL on top of the spark and by suing this, we can process the structure, unstructured and semi structured data.
• Apache Spark SQL provide a much relationship between relational database and procedural processing. This comes, when we want to load the data from traditional way into data lake ecosystem.
• Spark SQL is Apache Spark’s module for working with structured and semi-structured data and it originated to overcome the limitation of apache hive.
• It always dependent upon the MapReduce engine of Hadoop for execution and processing of data and allows the batch-oriented operation.
• Hive lags in performance uses to MapReduce jobs for executing ad[1]hoc process and hive does not allow you to resume a job processing, if it fails in the middle.
• Spark performs better operation than hive in many situation. Latency in the terms of hours and CPU reservation time.
• You can integrate the Spark SQL and queryingstructured, semi[1]structured data inside the apache spark.
• Spark SQL follows the RDD Model and it also support large job and middle query fault tolerance.
• You can easily connect the Spark SQL with the JDBC and ODBC for better connectivity of business purpose.
4. Spark Streaming:
• Apache Spark Streaming enables powerful interactive and data analytics application for live streaming data. In Streaming, data is not fixed and data comes from different source continuously.
• Stream divide the incoming input data into the small-small unit of data for further data analytics and data processing for next level.
• There are multilevel of processing involved in it. Live streamingdata are received and divided into small-small parts or batches and these small-small of data or batches are then processed or mixed by the spark engine to generate or produced the final level of streaming of data.
• Processing of data in the system in Hadoop has very high latency means that data is not received on timely manner and it is not suitable for real time processing requirement.
• Processing of data is generated by storm, if it is not happened again. But this type of mistake and latency give the data loss and repetition of records processing.
• Most of scenario, Hadoop are used for data batching and Apache Spark are used for the live streaming of data.
• Apache Streaming provide and help us to fix these types of issue and provides reliable, portable, scalable, efficiency, and integration of the system.
5. GraphX:
• GraphX is very powerful graph processing tool application programming interface for apache spark analytics engine.
• GraphX is a new component in a spark for graphs and graphs computation.
• GraphX follow the ETL process that means Extract, transform and Load, exploratory analysis, iterative graph computation within a single system.
• The usage can be seen in the Facebooks, LinkedIn connection, google map, and internet routers use these types of and analysis.
• Graph is an abstract data types that means it is used to implement the directed and undirected graph concepts from the mathematics in the graph theory concept.
• In the graph theory concept, each data associate with so with edge like numeric value.
• Every edge and node or vertex associated with it.
• GraphX has more Graph follow the ETL process that means Extract, transform and Load, exploratory analysis, iterative graph computation within a single system.
• Speed is one of the most important point in the point of Graph and it is comparable with the fastest graph system while when there is any fault tolerance and provide ease of use.
• We can choose lots of more feature that comes with more flexibilities and reliability and it provide library of graph algorithms.
6. Mesos:
• Apache Mesos is an open source cluster manager and it was developed by the universities of California, Berkeley.
• It provides all the required resource for the isolation and sharing purpose across all the distributed application.
• The software we are using for Mesos, pr fine-grained manner so that improving can be done with the help of this.
• Mesosphere enterprises DC/ this run specially on
• It can handle the workload in dynamic sharing and isolation manner.
• Apache Mesos could be deployed and manageable of large amount of data distribution scale cluster environment.
• Whatever the data are available in the existing system, it will be grouped together with the machine or node of the cluster into a single cluster so that load could be optimized.
• Apache Mesos is totally opposite to the virtualizations because in virtualization one physical resource is going to virtual resource but in Mesos multiple together to form a single virtual machine.
7. Akka:
• Akka is an actor concurrency, elasticity, and resilience processes.
• The actor can be controlled and limited to perform the intended task only. Akka is an open source library or toolkit.
• Apache Akka is used to create distributed and fault tolerance and it can be integrated to this library into the java virtual machine or JVM to support the language.
• Akka could be integrated with the Scala programming language and it is written in the Scala and it help us and developers to deal with external locking and threat management.
• The Actor is an entity which communicate with passing the massage to each other and it has its own state and behavior.
• In object-oriented programming like that everything is an object same thing is here like Akka is an actor based driven system.
• In other way we can say that Actor is an object that include and incapsulate it states and behavior.
8. Cassandra:
• Apache Cassandra is an open source distributed database system that is designed for storing and managing large amount of data across commodity servers.
• Cassandra can be used for both real time operational data store for online transaction data application.
• Cassandra is designed for to have peer to peer process continues nodes instead of master or named nodes to ensure that there should not be any single point of failure.
• Apache Cassandra is a highly scalable, high performance distributed database designed to handle large amounts of data and it is type of NoSQL Database.
• A NoSQL database is a database that provide mechanism to store and retrieve the data from the database than relational database.
• NoSQL database uses different data structure compared to relational database and it support very simple query language.
• NoSQL Database has no Schema and does not provide the data transaction.
• Cassandra is being used by some of the most important companies like Facebook, Twitter, Cisco, Netflix and much more.
• The component of the Cassandra is Node, Data Center, Commit Log, Cluster, Mem-Table, SS Table, Bloom Filter.
3. Kafka:
• Kafka is a high messaging backbone that enables communication between data processing entities and Kafka is written in java and Scala language.
• Apache Kafka is highly scalable, reliable, fast, and distributed system. Kafka is suitable for both offline and online message consumption.
• Kafka messages are stored on the hard disk and replicated within the cluster to prevent the data loss.
• Kafka is distributed, partitioned, replicated and fault tolerant which make it more reliable.
• Kafka messaging system scales easily without down time which make it more scalable. Kafka has high throughput for both publishing and subscribing messages, and it can store data up to TB.
• Kafka has unique platform for handling the real time data for feedback and it can handle large amount of data to diverse consumers.
• Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket.
Different Programming
languages in data science processing:
1. Elastic Search:
• Elastic Search is an open source, distributed search and analytical engine designed.
• Scalability mean that it can scale any point of view, reliability means that it should be trustable, stress free management.
• Combine the power of search and power of analytics so that developers, programmers, data engineer and data scientist could work with very smoothly with structures, un-structured, and time series data.
• Elastics search is an open source that means anyone can download and work with it and it is developed by using java language and most of the big organization are using this search engine for their need.
• It enables the user to expand the very large amount of data at very high speed.
• It is used for the replacement of the documents and data store in the database like mongo dB etc.
• Elastic search is one of the popular search engines and mostly used by the recent organization like google, stack Overflow, GitHub and much more. • Elastic Search is an open source search engine and is available under the hive version 2.0.
2. R Language:
• R is a programming language and it is used for statistical computing and graphics purpose.
• R Language are used by data engineer, data scientist,statisticians, and data miners for developing the software and performing data analytics.
• There is core requirement before learning the R Language and some depend on library and package concept that you should know about it and know how to work upon it easily.
• The related packages are of R Language is sqldf, forecast,dplyr,stringer, lubridate, ggplot2, reshape etc.
• R language is freely available, and it comes with General Public License and it supports many of the platform like windows, Linux/Unix, Mac.
• R language has built in capability to support and can be implemented and integrated with procedural language written in c, c++, java, .Net, and python.
• R Language has capacity and potential for handling data and data storage.
3. Scala:
• Scala is a general-purpose programming language and it support functional programming and a strong type statics type system.
• Most of the data science project and framework are build by using the Scala programming language because it has so many capabilities and potential to work with it.
• Scala integrate the feature of object-oriented language and its function because Scala can be written in java, c++, python language.
• Types and behavior of objects are described by the class and class can be extended by another class by using its properties.
• Scala support the high-level functions and function can be called by another function by using and written the function in a code.
• Once the Scala program is ready to compile and executive, Scala program convert into the byte code (machine understandable language) with the help of java virtual machine.
• This means that Scala and Java Programs can be complied and executed by using the JVM. So, we can easily say that it can be moved from Java to Scala and vice-versa.
• Scala enables you to use and import all the class, object and its behavior and function because Scala and java run with the help of Java Virtual Machine and you can create its own class and object.
• Instead of writing thousands of codes, Scala reduce the code in such a way that it can be readable, reliable, portable and reduce lines of code and support the developer and programmers to type the code in easy way.
4. Python:
• Python is a programming language and it can used on a server to create web application.
• Python can be used for web development, mathematics, software development and it is used to connect the database and create and modify the data.
• Python can handle the large amount of data and it is capable and potential to perform the complex task on data.
• Python is reliable, portable, and flexible to work on different platform like windows, mac and Linux etc.
• As compare to the other programming language , python is easy to learn and can perform the simple as well as complex task and it has the capabilities to reduce the line of code and help the programmer and developers to work with is easily friendly manner.
• Python support object-oriented programming language, functional and work with structure data.
• Python support dynamics data type and can be supported by dynamics type checking.
• Python is an interpreter and it has the philosophy and statements that it reduces the line of code.
No comments:
Post a Comment
Tell your requirements and How this blog helped you.