google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: TOP 6 DATA SCIENCE AND ANALYTICS TRENDS

Tuesday, February 8, 2022

TOP 6 DATA SCIENCE AND ANALYTICS TRENDS


How the Data Cloud accelerates machine learning

INTRODUCTION

Data science has evolved dramatically over the last 10 years. However, very few organizations have experienced the full business impact or competitive advantage from their advanced analytics, despite significant investments in data science and machine learning (ML). The reason? Many of the tools needed to scale ML are too complicated, and necessary skill sets are in short supply. But change is now afoot. Technology advancements in 2021 will significantly impact the way in which data scientists and data analysts work. In 2021, six trends have the potential to accelerate ML and move organizations from descriptive and diagnostic analytics (explaining what happened and why) towards predictive and prescriptive analytics that forecast what will happen and also provide powerful pointers on how to change the future.

In this ebook, you will learn how:

 Easy-to-use ML tools and consolidated data platforms empower data analysts and bridge the gap between analytics and ML

Snowflake’s Data Cloud can expand data access and data sharing through a secure ecosystem with access to ready-to-use third-party data  

Data engineering tools remove the burden of data prep for data scientists and make repurposing existing work easy

New distributed training frameworks offer an alternative superior to Spark while delivering up to 2,000x faster performance

Rapid advancements in ML libraries, tools, and frameworks demonstrate the need for a solution that future-proofs data science and ML investments

 

SHIFT TOWARDS PREDICTIVE AND PRESCRIPTIVE ANALYTICS CONTINUES

In 2021, the field of data science is poised to finally live up to the high expectations that many organizations have held for years. Over the last 10 years, huge investments have been made in data science and ML, guided by the hope that it would transform the way companies do business. However, many organizations continue to feel challenged to drive real business impact with analytics.

Recent studies lend credence to this common sentiment. A report from MIT Sloan Management Review and Boston Consulting Group found that only 10% of organizations are seeing significant financial benefits from their investments in AI,1 and VentureBeat AI reported that 87% of data science projects never make it into production.2 And in 2019, Gartner pointed to one of the bigger challenges organizations face: “Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization.”

Organizations invest in data science because it promises to bring competitive advantages, but many of the tools and skill sets needed to scale ML have been missing or in short supply. Data scientists continue to be a sought-after and expensive resource, and their valuable efforts tend to be relegated to time-consuming tasks such as data selection and data preparation. Conversely, data analysts are in abundant supply in most companies and already know how to address business problems directly, but they lack the technical background required to make the jump from analytics to data science to build their own ML models.

Remarkably, advancements made in 2020 point to six exciting trends for data science and ML in 2021. New tools and technologies have emerged—and continue to be released every month—that accelerate the work of data scientists while also empowering data analysts to move beyond descriptive analytics and conduct light data science and ML.

Underlying this acceleration is the cloud. Data scientists and data analysts benefit from cloud technologies that provide virtually unlimited amounts of compute resources. In addition, the cloud enables the elimination of data silos by consolidating data lakes, data warehouses, and data marts for fast, secure, and easy data sharing and analysis in a single location.

In short, data is transforming into an actionable asset, and new tools are using that reality to move the needle with ML. As a result, organizations are on the brink of mobilizing data to not only predict the future but to also increase the likelihood of certain outcomes through prescriptive analytics.

Here are six trends that will shape data science in 2021 and continue the evolution of analytics towards ML.

 

TREND #1: EASY-TO-USE ML TOOLS EMPOWER DATA ANALYSTS AND DATA SCIENTISTST

Most organizations employ an abundance of data analysts and a limited number of data scientists, due in large part to the limited supply and high costs associated with data scientists. Since analysts lack the data science expertise required to build ML models, data scientists have become the de facto bottleneck for broadening the use of ML.

However, new and improved ML tools are opening the floodgates on ML by automating the technical aspects of data science. Data analysts are empowered with access to powerful models without needing to manually build them. Specifically, automated machine learning (AutoML) and AI services via APIs are removing the need to manually prepare data and then build and train models.

 

AUTOML

AutoML tools are aptly named: They automate the tasks associated with developing and deploying ML models. Their development is game changing for both data scientists and data analysts. By automating tasks traditionally done by data scientists, AutoML tools enable data analysts to access models through an entirely graphical interface—without requiring a data scientist’s involvement or the need to write code.

But AutoML isn’t just for analysts. Thanks to its power, AutoML is making a huge difference for data scientists by addressing the busy work that can take up to 80% of their time, according to Harvard Business Review: loading, selecting, preparing, and cleaning data.4 By eliminating these time-consuming data chores, AutoML increases data scientists’ productivity and provides more time to conduct analyses. Additionally, human errors found in manual modeling processes are eliminated, which improves accuracy.

In addition, the one historical flaw of AutoML was that it was seen as a black box, but that challenge has been solved. AutoML services now provide transparency and explanations for their models, which is key for auditing and detecting bias. For data scientists, AutoML transforms how quickly they can build and test multiple models simultaneously. In 2020, AutoML tools from providers such as DataRobot, Dataiku, and H2O saw significant advancements, and new solutions were introduced such as Amazon SageMaker Autopilot.

 

AI SERVICES VIA AN API

Another approach growing in popularity is AI services, which are ready-made models available through APIs. Rather than use your own data to build and train models for common activities, organizations access pre-trained models that accomplish specific tasks. Whether an organization needs natural language processing (NLP), automatic speech recognition (ASR), or image recognition, AI services simply plug-and-play into an application through an API, which requires no involvement from a data scientist.

Amazon provides a variety of fully managed AI services, including Amazon Lex, Polly, Rekognition, Forecast, and Translate.5 To illustrate the value, Rekognition allows an image to be sent from an app through the API to Amazon; the AI service then returns a classification and description of what the image is. These types of utilities not only save time and effort, but they also free up data scientists to focus on building and training models that are highly customized to their business, rather than re-creating commonly used services.

AutoML tools and AI services lower the barrier to entry for ML, so almost anyone can now access and use data science without requiring an academic background. However, the true power of these tools is unleashed when they are integrated seamlessly with your existing technologies. With Snowflake Partner Connect, organizations can receive faster insights from their data through pre-built integrations between Snowflake and technology partners’ products. Snowflake Partner Connect makes it fast and easy to try new ML tools and services and then adopt those that best meet your business needs.

 

TREND #2: A CONSOLIDATED PLATFORM CLOSES THE GAP BETWEEN ANALYTICS AND ML

Everyone knows data silos exist within and across organizations. However, few realize that these siloes also take the form of “analytics siloes,” particularly between data scientists and data analysts. These analytics silos have formed as a result of the different ways the two roles work and their respective skill sets. Data silos are just one part of the difference: Data scientists and data analysts use different data (raw versus processed), data sources (data lakes versus warehouses and marts), languages (Python and Spark versus SQL), and tools (ML versus BI).

Much like organizational silos, analytics silos thwart collaboration and integration opportunities between data scientists and data analysts. This situation results in organizations missing out on the combined power of these two teams, which is exponentially stronger than simply the sum of the two parts.

For example, data analysts leverage data to provide key business metrics and answer questions around why something happened. According to Sisu, data analysts’ superpower is speed, and they use it to analyze data sets quickly and work with business stakeholders to uncover potential insights.6 While their goals are to help companies monetize market opportunities and improve competitive advantage, most of the work data analysts do is backwards looking because they lack the data science skills necessary to build predictive ML models. Instead, data analysts rely on BI tools whose dashboards have built[1]in limitations. While they can use data to understand what has already happened, it’s challenging for data analysts to be proactive and explore data deeply to figure out what will happen and how to influence it.

Conversely, data scientists have the ability to build ML models that not only predict but also influence business outcomes. However, they are not as well versed in the dynamic and fluctuating business environment as data analysts are. Sisu describes data scientists as “narrow-and-deep workers,” and their focus frequently results in organizations trying to focus data science efforts at known problems (often uncovered by data analysts) to maximize their value and contributions rather than potentially wasting time and effort on the unknown.

Snowflake’s Data Cloud provides the tools to help deliver stronger outcomes and scale. Through Snowflake, analytics silos are eliminated. The same consistent, governed metrics and data are available for both analytics and ML through a shared feature store and reuse of data engineering pipelines. When data science insights are shared in Snowflake’s platform, data analysts can access and incorporate them into dashboards and analysis, thus broadening the scope of impact of the models the data science team builds.

 

TREND #3: SNOWFLAKE’S DATA CLOUD EXPANDS ACCESS TO NEW DATA

In an IDC report sponsored by Seagate, IDC reports that by 2025, it expects 175 zettabytes of data to be created worldwide each year,8 which is approximately four times the amount produced in 2020 according to the World Economic Forum.9 The explosion of data produced by technologies such as IoT, social media, and mobile devices is opening up vast opportunities for data-driven insights into every area of inquiry.

Of course, it’s virtually impossible for any organization to produce or collect all the data needed to uncover business and competitive trends. Increasingly, the ability to share and join data sets, both within and across organizations, is viewed as the best way to derive more value from data. That’s why data scientists and data analysts are continually on the hunt for more data to supplement their ML models and analyses with external data to improve the accuracy of results.

Snowflake enables secure, governed, compliant, and seamless access to third-party data in three ways

1 Snowflake’s Data Cloud is an ecosystem where Snowflake customers, partners, data providers, and data service providers connect to their own data and seamlessly share and consume data and data services shared by other users. Underpinned by Snowflake’s platform, the Data Cloud eliminates barriers presented by siloed data and enables organizations to unify and connect to a single copy of data. In addition, the Data Cloud is a seamless way to derive value from rapidly growing commercialized data sets with fast, easy, and governed access.

2 Empowering the Data Cloud is Snowflake Secure Data Sharing, which removes traditional data transfer barriers. With Snowflake, data is generally never copied and transmitted. Instead, users can share live data from its original location. Those granted access simply reference data in a controlled and secure manner without latency or contention from concurrent users. Because changes to data are made to a single version, data remains up to date for all consumers, which ensures data models are always using the latest version.

3 Snowflake Secure Data Sharing is the technology foundation for Snowflake Data Marketplace, which serves as a single location to access live, ready-to-query data. Secure, governed data can be shared with, and received from, an ecosystem of business partners, suppliers, and customers, or from third-party data providers and data service providers. Snowflake Data Marketplace removes the arduous processes involved in locating the right data sets, signing contracts with vendors, and managing the data to make it compatible with internal data. Instead, data scientists and data analysts can source new data with ease. In addition to Snowflake Data Marketplace, organizations can use private data exchanges to share data with trusted partners, suppliers, vendors, and customers through Snowflake Secure Data Sharing.

External data is available and accessible to all Data Cloud users with just a few clicks. Once it’s in the Data Cloud, data is ready to be shared and consumed. There’s no sending of CSV files or manual version control. Data scientists can enrich models with seamless access to almost-unlimited data on any topic, including real-time and evolving circumstances. For example, in 2020, COVID data sets were universally in high demand across all sectors and industries because organizations needed to analyze the impact of the virus on an hour-by-hour basis.

 

TREND #4: DATA ENGINEERING TOOLS DECREASE DRAIN ON DATA SCIENTISTS’ TIME

Data engineering tasks require an inordinately large amount of a data scientist’s attention. The time commitment for data preparation varies. It can range from 45%, according to the Anaconda “2020 State of Data Science” survey reported by Datanami,10 to 80%, according to a survey conducted by CrowdFlower and reported by Forbes,11 but there is little disagreement among data scientists that collecting, organizing, and cleaning data are the least enjoyable tasks they undertake. Minimizing this burden is paramount not only for keeping data scientists productive and happy but also for broadening access to ML.

Finally, direct integrations between data engineering tools and ML tools are starting to bring these two worlds together. For example, DataRobot acquired Paxata to add data preparation tools alongside its AutoML offering,12 and Alteryx is shifting its focus from data prep towards an automated, assisted ML modeling offering.13 In addition, Amazon SageMaker launched two new services: Data Wrangler,14 to accelerate data prep for ML, and Pipelines,15 as a continuous integration and continuous delivery (CI/ CD) service for ML.

Data scientists are also benefiting from “feature stores,” which make it easy to repurpose existing work. For example, once a data scientist has converted raw data into a metric (for example, “cost of goods sold”), this universal metric can be found quickly and used by everyone else for quick analysis against that data. Not only does this practice save data scientists time and effort, but it also reinforces BI metrics, maintains data governance, and ensures there are no discrepancies across the work done by data analysts and data scientists.

Data engineering tools not only enable reuse of prepared data by anyone within an organization, but they can leverage the scalability and efficiency of Snowflake to provide the processing. Snowflake works with various partners that provide feature store solutions to ensure data scientists and data analysts can reuse consistent features.

 

TREND #5: NEW GENERATION OF DISTRIBUTED TRAINING FRAMEWORKS OFFERS COMPELLING ALTERNATIVE TO SPARK

Data scientists are always looking for strategic ways to inject efficiency into training and deploying models. Recently, a new generation of distributed training engines has surfaced that delivers on that goal by providing tremendous speed and performance gains over Apache Spark.

One approach that’s gaining attention is Dask, a distributed training framework built in Python.16 Dask is designed to enable data scientists to improve model accuracy faster. Data scientists can do everything in Python end to end, which means they no longer need to convert their code to execute in Spark. The result is reduced complexity and increased efficiency.

Another open source Python framework is RAPIDS, which is built on top of Dask.17 RAPIDS optimizes compute time and speed by providing data pipelines and executing data science code entirely on graphics processing units (GPUs) rather than CPUs. Saturn Cloud recently compared RAPIDS to Spark and discovered that model training with RAPIDS took one second on a 20-node GPU cluster, while Spark took 37 minutes on a similarly priced 20-node CPU cluster. Saturn Cloud concluded that RAPIDS enables 2,000x faster processing using GPUs while costing a fraction of the price.

The impact of these distributed training frameworks is already being seen in the real world. Walmart uses RAPIDS with Dask and XGBoost (an ML algorithm) for its data analytics and ML, and NVIDIA reports that Walmart has found that “one GPU server requires only four percent of the time needed to run the same forecasting models vs a 20-node CPU server.”19 That translates to Walmart running models in four hours that previously took several weeks using CPUs.

While organizations are thinking strategically about training frameworks, some have run into barriers in the past. Today, new technologies are unlocking what’s possible and demonstrating how much faster things can be when everything is done directly with Python. By eliminating the need to convert models into Spark, organizations are reducing complexity and increasing efficiency. And it’s easy to try different distributed training frameworks on Snowflake’s platform to find what works best.

 

TREND #6: CONTINUOUS RELEASES PROVIDE NEW OPTIONS FOR ML LIBRARIES, TOOLS, AND FRAMEWORKS

The field of data science is evolving rapidly. Not only are new ML and AI developments released every month, but new startups, tools, and solutions emerge regularly. With the rapid pace of innovation occurring in this space, it’s imperative not to get locked into using a single tool.

That’s why it’s important to select a platform that is vendor-, framework-, and algorithm-agnostic. By choosing a future-proof platform, you ensure that upcoming ML tools will continue to work seamlessly with the platform you have. After all, the last thing you want to do is re-platform in order to use the next generation of tools.

What makes Snowflake’s platform unique is its modern architecture. Designed with separate, but logically integrated, compute and storage, Snowflake eliminates the manual cluster-building efforts that other systems must perform to make separate layers work together. As a result, Snowflake provides a multi-cluster, shared data architecture that provides nearly infinite scalability, instant elasticity, and extremely high levels of concurrency to power the Data Cloud.

In addition to the underlying architecture, Snowflake supports data science in a variety of ways.

Snowflake’s External Functions allow any third-party, hosted, or custom ML service to be accessed easily using SQL.  

Recognizing that various teams may prefer languages other than SQL, Snowpark extends language support for Java, Scala, and, soon, Python. Snowpark allows data scientists to write code in their language of choice using familiar programming concepts, such as DataFrames, and then execute data preparation and workloads directly on Snowflake.

• Java user-defined functions (UDF) are supported to enable trained models to run within Snowflake. That means models built and trained in an ML partner’s technology can be brought into and run directly on Snowflake resources.

 

ACCELERATE YOUR MACHINE LEARNING IN 2021

It’s remarkable how quickly data science has become mainstream. In the last 10 years, companies have shifted their focus from reporting and historical analysis to conducting data science with advanced mathematical models and ML. The cloud changed everything. With the ability to inexpensively collect and store more and more data came the need to build data models powered by ML.

Today, a modern platform is a necessity if you want to analyze and share data quickly and scalably with security and governance built in. Snowflake provides an architecture that enables data consolidation, efficient data preparation, and an extensive partner ecosystem. Your data is mobilized, which allows you to benefit immediately from new trends in data science and ML.

With Snowflake, limitations on data science are removed. Are you ready to accelerate your machine learning?

 

No comments:

Post a Comment

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)