google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: BIG DATA ANALYTICS

Wednesday, February 2, 2022

BIG DATA ANALYTICS

 

NoSQL

NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale. NoSQL database is used for distributed data stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For example companies like Twitter, Facebook, Google that collect terabytes of user data every single day.

SQL

Structured Query language (SQL) pronounced as "S-Q-L" or sometimes as "See-Quel" is the standard language for dealing with Relational Databases. A relational database defines relationships in the form of tables.

SQL programming can be effectively used to insert, search, update, delete database records.

 

Comparison of SQL and NoSQL




Distributed Computing Challenges

Designing a distributed system does not come as easy and straight forward. A number of challenges need to be overcome in order to get the ideal system. The major challenges in distributed systems are listed below:



1. Heterogeneity:

The Internet enables users to access services and run applications over a heterogeneous collection of computers and networks. Heterogeneity (that is, variety and difference) applies to all of the following:

o Hardware devices: computers, tablets, mobile phones, embedded devices, etc.

o Operating System: Ms Windows, Linux, Mac, Unix, etc.

o Network: Local network, the Internet, wireless network, satellite links, etc.

o Programming languages: Java, C/C++, Python, PHP, etc.

o Different roles of software developers, designers, system managers

Different programming languages use different representations for characters and data structures such as arrays and records. These differences must be addressed if programs written in different languages are to be able to communicate with one another. Programs written by different developers cannot communicate with one another unless they use common standards, for example, for network communication and the representation of primitive data items and data structures in messages. For this to happen, standards need to be agreed and adopted – as have the Internet protocols. Middleware: The term middleware applies to a software layer that provides a programming abstraction as well as masking the heterogeneity of the underlying networks, hardware, operating systems and programming languages. Most middleware is implemented over the Internet protocols, which themselves mask the differences of the underlying networks, but all middleware deals with the differences in operating systems and hardware

Heterogeneity and mobile code: The term mobile code is used to refer to program code that can be transferred from one computer to another and run at the destination – Java applets are an example. Code suitable for running on one computer is not necessarily suitable for running on another because executable programs are normally specific both to the instruction set and to the host operating system.

 

2. Transparency:

Access Hide differences in data representation and how a resource is accessed

Location Hide where a resource is located

Migration Hide that a resource may move to another location

Relocation Hide that a resource may be moved to another location while in use

Replication Hide that a resource may be copied in several places

Concurrency Hide that a resource may be shared by several competitive users

Failure Hide the failure and recovery of a resource

Persistence Hide whether a (software) resource is in memory or a disk

 

3. Openness

The openness of a computer system is the characteristic that determines whether the system can be extended and re-implemented in various ways. The openness of distributed systems is determined primarily by the degree to which new resource-sharing services can be added and be made available for use by a variety of client programs. If the well-defined interfaces for a system are published, it is easier for developers to add new features or replace sub-systems in the future. Example: Twitter and Facebook have API that allows developers to develop their own software interactively.

 

4. Concurrency

Both services and applications provide resources that can be shared by clients in a distributed system. There is therefore a possibility that several clients will attempt to access a shared resource at the same time. For example, a data structure that records bids for an auction may be accessed very frequently when it gets close to the deadline time. For an object to be safe in a concurrent environment, its operations must be synchronized in such a way that its data remains consistent. This can be achieved by standard techniques such as semaphores, which are used in most operating systems.

 

5. Security

Many of the information resources that are made available and maintained in distributed systems have a high intrinsic value to their users. Their security is therefore of considerable importance. Security for information resources has three components: confidentiality (protection against disclosure to unauthorized individuals) integrity (protection against alteration or corruption), availability for the authorized (protection against interference with the means to access the resources).

 

6. Scalability

Distributed systems must be scalable as the number of user increases. The scalability is defined by B. Clifford Neuman as

A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance or increase in administrative complexity

Scalability has 3 dimensions:

o Size

     o Number of users and resources to be processed. Problem associated is overloading

o Geography

      o Distance between users and resources. Problem associated is communication reliability

o Administration

      o As the size of distributed systems increases, many of the system needs to be controlled. Problem associated is administrative mess

 

7. Failure Handling

Computer systems sometimes fail. When faults occur in hardware or software, programs may produce incorrect results or may stop before they have completed the intended computation. The handling of failures is particularly difficult.

 

Hadoop Overview

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

 

Hadoop Architecture

At its core, Hadoop has two major layers namely –

· Processing/Computation layer (MapReduce), and

·  Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules –

· Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.

· Hadoop YARN − This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an alternative, you can tie together many commodity computers with single[1]CPU, as a single functional distributed system and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs –

· Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M (preferably 128M).  

· These files are then distributed across various cluster nodes for further processing.

·  HDFS, being on top of the local file system, supervises the processing.

·  Blocks are replicated for handling hardware failure.

·  Checking that the code was executed successfully.

·  Performing the sort that takes place between the map and reduce stages.

·  Sending the sorted data to a certain computer.

·  Writing the debugging logs for each job.

 

Advantages of Hadoop

· Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU cores.

· Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.

· Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.

· Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since it is Java based.

 

Processing Data with Hadoop - Managing Resources and Applications with Hadoop YARN

Yarn divides the task on resource management and job scheduling/monitoring into separate daemons. There is one ResourceManager and per-application ApplicationMaster. An application can be either a job or a DAG of jobs.

The ResourceManger have two components – Scheduler and AppicationManager.

The scheduler is a pure scheduler i.e. it does not track the status of running application. It only allocates resources to various competing applications. Also, it does not restart the job after failure due to hardware or application failure. The scheduler allocates the resources based on an abstract notion of a container. A container is nothing but a fraction of resources like CPU, memory, disk, network etc.

Following are the tasks of ApplicationManager:-

· Accepts submission of jobs by client.

·  Negotaites first container for specific ApplicationMaster.

·  Restarts the container after application failure.

Below are the responsibilities of ApplicationMaster

·  Negotiates containers from Scheduler

·  Tracking container status and monitoring its progress.

Yarn supports the concept of Resource Reservation via Reservation System. In this, a user can fix a number of resources for execution of a particular job over time and temporal constraints. The Reservation System makes sure that the resources are available to the job until its completion. It also performs admission control for reservation.

 

Yarn can scale beyond a few thousand nodes via Yarn Federation. YARN Federation allows to wire multiple sub-cluster into the single massive cluster. We can use many independent clusters together for a single large job. It can be used to achieve a large scale system.

 

Let us summarize how Hadoop works step by step:

· Input data is broken into blocks of size 128 Mb and then blocks are moved to different nodes.

·  Once all the blocks of the data are stored on data-nodes, the user can process the data.

·  Resource Manager then schedules the program (submitted by the user) on individual nodes.

·  Once all the nodes process the data, the output is written back to HDFS.

 

Interacting with Hadoop Ecosystem

Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its three core components processing, resource management, and storage. In this topic, you will learn the components of the Hadoop ecosystem and how they perform their roles during Big Data processing. The Hadoop ecosystem is continuously growing to meet the needs of Big Data. It comprises the following twelve components:

 

· HDFS(Hadoop Distributed file system)

·  Hbase

·  Sqoop

·  Flume

·  Spark

·  Hadoop MapReduce

·  Pig

·  Impala

·  Hive

·  Cloudera Search

·  Oozie

·  Hue.

Let us understand the role of each component of the Hadoop ecosystem.

Components of Hadoop Ecosystem

Let us start with the first component HDFS of Hadoop Ecosystem.

HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

· HDFS is a storage layer for Hadoop.

·  HDFS is suitable for distributed storage and processing, that is, while the data is being stored, it first gets distributed and then it is processed.  

· HDFS provides Streaming access to file system data.

·  HDFS provides file permission and authentication.

·  HDFS uses a command line interface to interact with Hadoop.

 

So what stores data in HDFS? It is the HBase which stores data in HDFS.

 

HBase

· HBase is a NoSQL database or non-relational database .

·  HBase is important and mainly used when you need random, real-time, read, or write access to your Big Data.

· It provides support to a high volume of data and high throughput.

·  In an HBase, a table can have thousands of columns


No comments:

Post a Comment

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)