google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: Data Lifecycle and Analytics in the AWS Cloud

Monday, February 21, 2022

Data Lifecycle and Analytics in the AWS Cloud

 

Introduction

Data within organizations is measured in petabytes, and it grows exponentially each year.

IT teams are under pressure to quickly orchestrate data storage, analytics, and visualization projects that get the most from their organizations’ number one asset: their data. They’re also tasked with ensuring customer privacy and meeting security and compliance mandates. These challenges are cost-effectively addressed with cloud-based IT resources, as an alternative to fixed, conventional IT infrastructure (e.g. owned data centers and computing hardware managed by internal IT departments). By modernizing their approach to data lifecycle management, and leveraging the latest cloud-native analytics tools, organizations reduce costs and gain operational efficiencies, while enabling data-driven decision-making.

 

What is Big Data?

A dataset too large or complex for traditional data processing mechanisms is called “big data”. Big data also encompasses a set of data management challenges resulting from an increase in volume, velocity, and variety of data. These challenges cannot be solved with conventional data storage, database, networking, compute, or analytics solutions. Big data includes structured, semi-structured, and unstructured data. “Small data,” on the other hand, refers to structured data that is manageable within existing databases. Whether your data is big or small, the lifecycle stages are universal. It’s the data management and IT tools that will differ in terms of scale and costs.

To support advanced analytics for big and small data projects, cloud services support a variety of use cases: descriptive analytics that address what happened and why (e.g. traditional queries, scorecards, and dashboards); predictive analytics that measure the probability of a given event in the future (e.g. early alert systems, fraud detection, preventive maintenance applications, and forecasting); and prescriptive analytics that answer, “What should I do if ‘x’ happens?” (e.g. recommendation engines). With AWS, it’s also technically and economically feasible to collect, store, and share larger datasets and analyze them to reveal actionable insights.

AWS offers a complete cloud platform designed for big data across data lakes or big data stores, data warehousing, distributed analytics, real-time streaming, machine learning, and business intelligence services. These cloud-based IT infrastructure building blocks – along with AWS Cloud capabilities that meet the strictest security requirements – can help address a wide range of analytics challenges.

 

What is the Data Lifecycle?

As data is generated, it moves from its raw form to a processed version, to outputs that end users need to make better decisions. All data goes through this data lifecycle. Organizations can use AWS Cloud services in each stage of the data lifecycle to quickly and cost-effectively prepare, process, and present data to derive more value from it. The five data lifecycle stages include: data ingestion, data staging, data cleansing, data analytics and visualization, and data archiving.

 


1. The first stage is data ingestion. Data ingestion is the movement of data from an external source to another location for analysis. Data can move from local or physical disks where value is locked (e.g. in an IT data center), to the cloud’s virtual disks. There, it can be closer to end users and where machine learning and analytics tools can be applied. During data ingestion, high value data sources are identified, validated, and imported while data files are stored and backed up in the AWS Cloud. Data in the cloud is durable, resilient, secure, cost-effectively stored, and most importantly, accessible to a broad set of users. Common data sources include transaction files, large systems (e.g. CRM, ERP), user-generated data (e.g. clickstream data, log files), sensor data (e.g. from Internet-of[1]Things or mobile devices), and databases. AWS services available in this stage include Amazon Kinesis, AWS Direct Connect, AWS Snowball/ Snowball Edge/Snowmobile, AWS DataSync, AWS Database Migration Service, and AWS Storage Gateway.

 

2. The second stage is data staging. Data staging involves performing housekeeping tasks prior to making data available to users. Organizations house data in multiple systems or locations, including data warehouses, spreadsheets, databases, and text files. Cloud-based tools make it easy to stage data or create a data lake in one location (e.g. Amazon S3), while avoiding disparate storage mechanisms. AWS services available in this stage include Amazon S3, Amazon Aurora, Amazon RDS, Amazon DynamoDB.

 

3. The third stage is data cleansing. Before data is analyzed, data cleansing detects, corrects, and removes inaccurate data or corrupted records or files. It also identifies opportunities to append or modify dirty data to improve the accuracy of analytical outputs. In some cases, data cleansing involves translating files, turning speech files to text, digitizing audio and image files for processing, or adding metadata tags for easier search and classification. Ultimately, data cleansing transforms data so it’s optimized for code (e.g. Extract, Transform, Load (ETL)). AWS services available in this stage include AWS Glue (ETL), AWS Glue Data Catalog, Amazon EMR, and Amazon SageMaker Ground Truth.

 

4. The fourth stage is data analytics and visualization. The real value of data can be extracted in this stage. Decision-makers use analytics and visualization tools to predict customer needs, improve operations, transform broken processes, and innovate to compete. The ability for mission owners and executives to rely on data reduces error-prone and costly guesswork. AWS services available in this stage include Amazon Athena, Amazon Redshift, Amazon QuickSight, Amazon SageMaker, Amazon Comprehend, Amazon Comprehend Medical, and AWS DeepLens.

5. The fifth stage is data archiving. The AWS Cloud facilitates data archiving, enabling IT departments to invest more time in other stages of the data lifecycle. These storage solutions have achieved numerous compliance standards, security certifications, and provide built-in encryption, enabling compliance from day one. AWS services in this stage include Amazon S3 Glacier, Amazon S3 Glacier Deep Archive, and AWS Storage Gateway.

 

Common data management challenges to address in the data lifecycle.

 

Many organizations are sitting on valuable data and performing little-to-no analysis on it. While some organizations recognize and capitalize on this value, others are hindered by concerns of draining resources by building complex analytics projects.

Organizations of all sizes face challenges as they seek to derive meaning and value from their data. At each stage of the data lifecycle, the “five Vs” of data management can help dictate which tools are required to address a particular problem. This includes the volume, velocity, variety, veracity, and value of your data. Challenges associated with the five Vs include a high volume of data, data in multiple systems and formats, increasingly diverse users with differing analytics needs, the requirement to support emerging predictive and real-time analytics, semi-structured or unstructured data, and a lack of in-house data science expertise.

Data, data, everywhere – a growing volume. Organizations are amassing and storing ever-increasing amounts of data, yet only a fraction of that data enters the analytics stage. Data is often housed in multiple systems or locations, including data warehouses, log files, and databases. As this volume of data grows, classifying it becomes critical. Is it qualitative or quantitative? What are the storage costs and can this storage mechanism scale? How much of this data is being used, and how much is being ignored? What are the common sources of this data? These questions help determine the cost-benefit analysis of employing various options for handling high volumes of data.

Velocity of data. In addition, data is being generated at a greater velocity. Data velocity impacts highly meaningful applications, including public safety and emergency alerts, cybersecurity or physical security breaches, customer service applications, IoT sensor data that triggers immediate responses, and early indicators that drive interventions. Is the data being generated in real-time, in batches at frequent intervals, or as a result of events? As you amass data at faster rates, it calls for a modern IT approach – one that simplifies analysis, reduces storage costs, and untethers data from conventional data centers for easier analysis. These challenges continue to outstrip conventional, in-house IT infrastructure resources.

Questionable data veracity or dirty data. Data veracity refers to the integrity, reliability, accuracy, and trustworthiness of data. Data that is unbiased, de-duplicated, consistent, and stable is ideal. It helps ensure that analytics outputs based on that data are also accurate. In the instance of a survey, is there a way to validate that users accurately entered their zip codes, and for the survey owner to remove or correct bad zip codes?

A broad variety of data. The variety of data in an organization spans structured, unstructured, or semi-structured data. These formats often dictate how data is stored, processed, and analyzed. For example, structured data is often housed in relational databases with defined rows and columns (e.g. social security numbers, birth dates, zip codes). Semi[1]structured data does not follow the formal arrangement of databases or data tables, but has tags to indicate a hierarchy of records and fields within the data. NoSQL databases and JSON documents are considered semi[1]structured documents and may be stored as flat files and object stores.

Lastly, unstructured data may include audio files, images, videos, and other file types (e.g. meeting notes). Unstructured data requires the additional step of adding metadata tags for easier search and retrieval. Without a conscious effort to include the unstructured data in the analysis pipeline, an organization risks missing out on relevant insights. This unstructured data, when left untapped, is labeled ‘dark data.’

The variable value of data. Not all data is of equal value. Assigning and ranking the value of data is required for prioritization, before embarking on a project. It’s important to consider what outcomes the data is driving, who uses it, whether it’s essential for mission-critical work; and how often multiple users will need it for security, compliance, or analysis.

Diversity of data users, stakeholders, and stewards. A multitude of organizational stakeholders can benefit from data to do their jobs well. To accomplish this, each may rely on a slightly different combination of data visualization tools and analytics packages. This begets the need to break down silos within a data management practice so that teams can share data, collaborate, and drive greater insights. Placing a business intelligence (BI) tool on top of a dataset to run reports is a common use case, however, data scientists and developers may conduct analyses against a variety of data sources including the use of APIs.

Evolving from retrospective analytics to real-time and predictive analytics. Data analysis isn’t purely retrospective. Data can also be used to develop and deploy models for real-time predictions. More than before, organizations want to extract predictions from their data and use machine learning to detect patterns previously unseen in their historical data.

Lack of in-house data science expertise. Data scientists typically extract insights from data by applying engineering tools to a variety of data sources. They also prepare and process data, know “what” information is most valuable for organizational decision-making, and work with data engineers to answer the “why.” Still, many organizations lack the in-house data science expertise to support necessary large-scale data analytics projects.

 

The data lifecycle in detail

Let’s take a deeper look at the five stages of the data lifecycle involved in the preparation, processing, and presentation of data for decision-making* :

 


Stage 1 – Data Ingestion

Data ingestion entails the movement of data from an external source, into another location for further analysis. Generally, the destination for data is some form of storage or a database (we discuss storage further in the Data Staging section of this guide). For example, ingestion can involve moving data from an on-premises data center or physical disks, to virtual disks in the cloud, accessed via an internet connection. Data ingestion also involves identifying the correct data sources, validating and importing data files from those sources, and sending the data to the desired destination. Data sources can include transactions, enterprise-scale systems such as Enterprise Resource Planning (ERP) systems, clickstream data, log files, device or sensor data, or disparate databases.

Key questions to consider: What is the volume and velocity of my data? For instance, ingesting website clickstream data, ERP data, or sensor data in an IoT scenario would warrant AWS Kinesis Data Streams However, ingesting a database would be best accomplished through the AWS Database Migration Service (DMS). What is the source and format of the data? For satellite data involving micro batch processing, S3 Transfer Acceleration is the more suitable ingestion solution. Still, the selection of the AWS ingestion service will depend on the source, volume, velocity, and format of the data at hand.

A real-time streaming data service: Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so that you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application. With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstream data, and IoT telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables you to process and analyze data as it arrives and respond instantly instead of having to wait until all your data is collected before the processing can begin.

A video streaming service: Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, ML analysis, playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices. It also durably stores, encrypts, and indexes video data in your streams, and allows you to access your data through easy-to-use APIs. Kinesis Video Streams enables you to play back video for live and on-demand viewing, and quickly build applications that take advantage of computer vision and video analytics through integration with Amazon Recognition Video, and libraries for ML frameworks, such as Apache MxNet, TensorFlow, and OpenCV.

Brain Power

A real time data streaming service: Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location[1]tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more.

A capture-transform-load of streamed data: Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.

A dedicated network between AWS and on-premises IT: AWS Direct Connect is a cloud service solution that makes it easy to establish a dedicated network connection from your premises to AWS – bypassing your Internet service provider and removing network congestion. Transferring large data sets over the Internet can be time-consuming and expensive. Using AWS Direct Connect, you can establish private connectivity between AWS and your data center, office, or colocation environment, which, in many cases, can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.

 


A storage device for data transport: AWS Snowball is a data transport hardware device designed to securely transfer large amounts of data into and out of the AWS Cloud. AWS Snowball addresses common challenges with large-scale data transfers, including high network costs, long transfer times, and security concerns. Customers use Snowball to migrate analytics data, genomics data, video libraries, image repositories, and backups; as well as to archive part of data center shutdowns, for tape replacement, or for application migration projects. Transferring data with Snowball is fast, secure, and can cost as little as one-fifth of the price of transferring data via high-speed Internet.

If you are running MapReduce jobs on premises and storing data in the Hadoop Distributed File System (HDFS), you can now copy that data directly from HDFS to an AWS Snowball without using an intermediary staging file. Because HDFS is often used for big data workloads, this can greatly simplify the process of importing large amounts of data to AWS for further processing.

 


A storage device for data transport with compute: AWS Snowball Edge is a data migration and edge computing device that comes in two options. Snowball Edge Storage Optimized provides 100 TB of capacity and 24 vCPUs and is well suited for local storage and large scale data transfer. Snowball Edge Compute Optimized provides 52 vCPUs and an optional GPU for use cases such as advanced machine learning and full motion video analysis in disconnected environments. Customers can use these two options for data collection, machine learning and processing, and storage in environments with intermittent connectivity (such as manufacturing, industrial, and transportation), or in extremely remote locations (such as military or maritime operations) before shipping it back to AWS. These devices may also be rack mounted and clustered together to build larger, temporary installations.

When to use Snowball Edge? Besides storage, Snowball Edge includes compute capabilities. Snowball Edge supports specific Amazon EC2 instance types as well as AWS Lambda functions, so customers may develop and test in AWS then deploy applications on devices in remote locations to collect, pre-process, and return the data. Common use cases include data migration, data transport, image collation, IoT sensor stream capture, and machine learning.

A large scale storage device for data transport: AWS Snowmobile is a petabyte-scale data transfer service used to move extremely large amounts of data to AWS. You can transfer up to 100PB per Snowmobile, a 45-foot long ruggedized shipping container, pulled by a semi-trailer truck. Snowmobile makes it easy to move massive volumes of data to the cloud, including video libraries, image repositories, or even a complete data center migration. Transferring data with Snowmobile is secure, fast, and cost effective.

A long distance file transfer service: Amazon S3 Transfer Acceleration enables fast, easy, and secure transfers of files over long distances (100 or more miles) between your client and your Amazon S3 bucket. We will discuss Amazon S3 in greater detail in the next chapter on staging. During ingestion, Amazon S3 Transfer Acceleration leverages Amazon CloudFront’s globally distributed AWS edge locations. An edge location is where end users can access services located in the AWS Cloud in closer proximity with reduced latency. AWS has 155 edge locations around the world, and is used in conjunction with content delivery network, Amazon CloudFront. As data arrives at an AWS edge location, data is routed to an Amazon S3 bucket over an optimized network path.

A data center gateway service: AWS Storage Gateway is a hybrid-storage service that enables your on-premises applications to seamlessly use AWS Cloud storage. You can use the service for backup and archiving, disaster recovery, cloud data processing, storage tiering, and migration. The service helps you reduce and simplify your data center and branch, or remote office storage infrastructure.

Your applications connect to the service through a virtual machine or hardware gateway appliance using standard storage protocols, such as NFS, SMB and iSCSI. The gateway connects to AWS storage services, such as Amazon S3, Amazon S3 Glacier, Amazon S3 Glacier Deep Archive, Amazon EBS, and AWS Backup, providing storage for files, volumes, snapshots, and virtual tapes in AWS.

A database migration service: AWS Database Migration Service helps migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. The AWS Database Migration Service can migrate your data to and from the most widely used commercial and open-source databases. As of early 2019, more than 120,000 databases had been migrated using AWS Database Migration Service.

AWS Database Migration Service supports homogenous migrations, such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle or Microsoft SQL Server to Amazon Aurora. With AWS Database Migration Service, you can continuously replicate your data with high availability and consolidate databases into a petabyte[1]scale data warehouse by streaming data to Amazon Redshift and Amazon S3.

Moving data across databases: AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals. With AWS Data Pipeline, you can regularly access your data where it’s stored, transform and process it at scale; and efficiently transfer the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.

A data transfer service between on-premises and AWS Cloud: AWS DataSync is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS). DataSync automatically handles many of the tasks related to data transfers that can slow down migrations or burden your IT operations, including running your own instances, handling encryption, managing scripts, network optimization, and data integrity validation. You can use DataSync to transfer data at speeds up to 10 times faster than open-source tools. DataSync uses an on-premises software agent to connect to your existing storage or file systems using the Network File System (NFS) protocol, so you don’t have write scripts or modify your applications to work with AWS APIs. You can use DataSync to copy data over AWS Direct Connect or internet links to AWS. The service enables one[1]time data migrations, recurring data processing workflows, and automated replication for data protection and recovery. Getting started with DataSync is easy: Deploy the DataSync agent on premises, connect it to a file system or storage array, select Amazon EFS or S3 as your AWS storage, and start moving data. You pay only for the data you copy.

You can also transform and move AWS Cloud data into your data store using AWS Glue. AWS Glue is covered in more depth in Stage 3.

Online Data Transfer and Migration—AWS DataSync–Amazon Web Services

 

No comments:

Post a Comment

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)