google.com, pub-4497197638514141, DIRECT, f08c47fec0942fa0 Industries Needs: A Review on Data Science Tools

Wednesday, February 16, 2022

A Review on Data Science Tools


Abstract

Computer programming is an important part of data science. It is a fact that an individual with sufficient knowledge in programming knowledge in programing logic, functions, and loops has a higher chance of succeeding in data science. Python and R are the main programming languages which are used in data science. They help one to create machine learning models which can draw patterns and make predictions based on the data that you have. However not everyone has studied computer programming a number of people want to venture into data sciences but they do not have knowledge in programming however there are number of software tools which provide for use in data science. they provide the user with Graphical User Interface (GUI) which they can following number of steps so as to create machine learning models even without sufficient knowledge in programming and algorithms. This paper gives a clear idea about different software tools which are used in data science.

 

1. Introduction

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is the same concept as data mining and big data: "use the most powerful hardware, the most powerful programming systems, and the most efficient algorithms to solve problems".

Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyse actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.

 


In 2012, when Harvard Business Review called it "The Sexiest Job of the 21st Century", the term "data science" became a buzzword. It is now often used interchangeably with earlier concepts like business analytics, business intelligence, predictive modelling, and statistics.

 


Even the suggestion that data science is sexy was paraphrasing Hans Rosling, featured in a 2011 BBC documentary with the quote, "Statistics is now the sexiest subject around [7]." Nate Silver referred to data science as a sexed-up term for statistics. In many cases, earlier approaches and solutions are now simply rebranded as "data science" to be more attractive, which can cause the term to become "dilute beyond usefulness." While many university programs now offer a data science degree, there exists no consensus on a definition or suitable curriculum contents. To its discredit, however, many data-science and big[1]data projects fail to deliver useful results, often as a result of poor management and utilization of resources.

 

2. Tools

Data scientists are inquisitive and often seek out new tools that help them find answers. They also need to be proficient in using the tools of the trade, even though there are dozens upon dozens of them [9]. Overall, data scientists should have a working knowledge of statistical programming languages for constructing data processing systems, databases, and visualization tools. Many in the field also deem a knowledge of programming an integral part of data science; however, not all data scientist students study programming, so it is helpful to be aware of tools that circumvent programming and include a user-friendly graphical interface so that data scientists’ knowledge of algorithms is enough to help them build predictive models[10].

 


With everything on a data scientist’s plate, you don’t have time to search for the tools of the trade that can help you do your work. That’s why I have rounded up tools that aid in data visualization, algorithms, statistical programming languages, and databases.

Data science tools can be of two types. One for those who have programming knowledge and another for the business users. Tools which are for business users, automate the analysis [11].

 

2.1 Classification of Data Science Software

 


1) RapidMiner

Price: A free trial is available for 30 days. RapidMiner Studio price starts at $2500 per user/month. RapidMiner Server price starts at $15000 per year. RapidMiner Radoop is free for a single user. Its enterprise plan is for $15000 per year.

 

RapidMiner is a tool for the complete life-cycle of prediction modelling. It has all the functionalities for data preparation, model building, validation, and deployment. It provides a GUI to connect the predefined blocks [12-13].

 

Features:  

· RapidMiner Studio is for data preparation, visualization, and statistical modelling.

·  RapidMiner Server provides central repositories.

·  RapidMiner Radoop is for implementing big-data analytics functionalities.

·  RapidMiner Cloud is a cloud-based repository.

 

2) Data Robot

Price: Contact the company for detailed pricing information.

 


Data Robot is the platform for automated machine learning. It can be used by data scientists, executives, software engineers, and IT professionals [8,11].

 

Features:  

· It provides an easy deployment process.

·  It has a Python SDK and APIs.

·  It allows parallel processing.

·  Model Optimization.

 

3) Apache Hadoop

Apache Hadoop is an open source framework. Simple programming models that are created using Apache Hadoop, can perform distributed processing of large data sets across computer clusters.

 

Price: It is available for free.

Features:

· It is a scalable platform.

·  Failures can be detected and handled at the application layer.

·  It has many modules like Hadoop Common, HDFS, Hadoop Map Reduce, Hadoop Ozone, and Hadoop YARN.

 


4) Trifacta

Price: Trifacta has three pricing plans, i.e. Wrangler, Wrangler Pro, and Wrangler Enterprise. For the Wrangler plan, you can sign up for free [7,10]. You will have to contact the company to know more about the pricing details of the other two plans.

Trifacta provides three products for data wrangling and data preparation. It can be used by individuals, teams, and organizations.

 


Features:

· Trifacta Wrangler will help you in exploring, transforming, cleaning, and joining the desktop files together.  

· Trifacta Wrangler Pro is an advanced self-service platform for data preparation.

·  Trifacta Wrangler Enterprise is for empowering the analyst team.

 

5) Alteryx

Price: Alteryx Designer is available for $5195 per user per year. Alteryx Server is for $58500 per year. For both the plans, additional capabilities are available at an additional cost [13-14].

 



Alteryx provides a platform to discover, prep, and analyse the data. It will also help you to find deeper insights by deploying and sharing the analytics at scale.

 

Features:

· It provides the features to discover the data and collaborate across the organization.

·  It has functionalities to prepare and analyse the model.

·  The platform will allow you to centrally manage users, workflows, and data assets.

·  It will allow you to embed R, Python, and Alteryx models into your processes.

 

6) KNIME

Price: It is available for free.

 


KNIME for data scientists will help them in blending tools and data types. It is an open source platform. It will allow you to use the tools of your choice and expand them with additional capabilities [15-17].

 

Features:

· It is very useful for the repetitive and time-consuming aspects.

·  Experiments and expands to Apache Spark and Big data.

·  It can work with many data sources and different types of platforms.

 

7) Excel

Price: Office 365 for personal use: $69.99 per year, Office 365 Home: $99.99 per year, Office Home & Student: $149.99 per year. Office 365 Business is for $8.25 per user per month. Office 365 Business Premium is at $ 12.50 per user per month. Office 365 Business Essentials is at $5 per user per month [16].

 


Excel can be used as a tool for data science. It is easy to use tool for non-technical persons. It is good for analysing data.

 

Features:  

· It has good features for organizing and summarizing the data.

·  It will allow you to sort and filter the data.

·  It has conditional formatting features.

 

8) Matlab

Price: Matlab for an individual user is at $2150 for a perpetual license & $860 for an annual license. A free trial is available for this plan. It is also available for Students as well as for personal use [12].

Price: Matlab for an individual user is at $2150 for a perpetual license & $860 for an annual license. A free trial is available for this plan. It is also available for Students as well as for personal use [12].

Features:

· Matlab has interactive apps which will show you the working of different algorithms on your data.  

· It has the ability to scale.

·  Matlab algorithms can be directly converted to C/C++, HDL, and CUDA code.

 


9) Java

Price: Free

 


Java is an object-oriented programming language. The compiled Java code can be run on any Java supported platform without recompiling it. Java is simple, object-oriented, architecture[1]neutral, platform-independent, portable, multi-threaded, and secure [18-20].

 

Features:

As features, we will see why Java is used for data science:

· Java provides a good number of tools and libraries that are useful for machine learning and data science.  

· Java 8 with Lambdas: With this, you can develop large data science projects [2].

·  Scala provides the support to data science.

 

10) Python

Price: Free

 



Python is a high-level programming language and provides a large standard library. It has the features of object-oriented, functional, procedural, dynamic type, and automatic memory management [19-21].

 

Features:

· It is used by data scientists as it provides a good number of useful packages to download for free.  

· Python is extensible.

·  It provides free data analysis libraries.

 

2.2 Additional Data Science Tools

11) R R

is a programming language and can be used on a UNIX platform, Windows, and Mac OS [20].

 

12) SQL

This domain-specific language is used for managing the data from RDBMS through programming.

 

13) Tableau

Tableau can be used by individuals as well as teams and organizations. It can work with any database. It is easy to use because of its drag-and-drop functionality.

 

14) Cloud Data-Flow

Cloud Data-Flow is for stream and batch processing of data. It is a fully-managed service. It can transform and enrich the data in the stream and batch mode.

 

15) Kubernetes

Kubernetes provides an open source tool. It is used to automate the deployment, scale, and manage containerized applications.

 

3. Applications

Data science is a subject that arose primarily from necessity, in the context of real[1]world applications instead of as a research domain. Over the years, it has evolved from being used in the relatively narrow field of statistics and analytics to being a universal presence in all areas of science and industry. In this section, we look at some of the principal areas of applications and research where data science is currently used and is at the forefront of innovation.

 

a. Business Analytics –Collecting data about the past and present performance of a business can provide insight into the functioning of the business and help drive decision[1]making processes and build predictive models to forecast future performance. Some scientists have argued that data science is nothing more than a new word for business analytics, which was a meteorically rising field a few years ago, only to be replaced by the new buzzword data science [21-22]. Whether or not the two fields can be considered to be mutually independent, there is no doubt that data science is in universal use in the field of business analytics.

 

b. Prediction – Large amounts of data collected and analysed can be used to identify patterns in data, which can in turn be used to build predictive models. This is the basis of the field of machine learning, where knowledge is discovered using induction algorithms and on other algorithms that are said to “learn”. Machine learning techniques are largely used to build predictive models in numerous fields.

 

c. Security – Data collected from user logs are used to detect fraud using data science. Patterns detected in user activity can be used to isolate cases of fraud and malicious insiders. Bank and other financial institutions chiefly use data mining and machine learning algorithms to prevent cases of fraud [7,15].

 

d. Computer Vision – Data from image and video analysis is used to implement computer vision, which is the science of making computers “see”, using image data and learning algorithms to acquire and analyse images and take decisions accordingly [23-24]. This is used in robotics, autonomous vehicles and human-computer interaction applications.

 

e. Natural Language Processing – Modern NLP techniques use huge amounts of textual data from corpora of documents to statistically model linguistic data, and use these models to achieve tasks like machine translation, parsing, natural language generation and sentiment analysis [4,10].

 


f. Bioinformatics – Bioinformatics is a rapidly growing area where computers and data are used to understand biological data, such as genetics and genomics. These are used to better understand the basis of diseases, desirable genetic properties and other biological properties. As pointed out by Michael Walker – “Next-generation genomic technologies allow data scientists to drastically increase the amount of genomic data collected on large study populations [9, 24]. When combined with new informatics approaches that integrate many kinds of data with genomic data in disease research, we will better understand the genetic bases of drug response and disease.”

 

g. Science and Research – Scientific experiments such as the well-known Large Hadron Collider project generate data from millions of sensors and their data have to be analysed to draw meaningful conclusions. Astronomical data from modern telescopes and climatic data stored by the NASA Centre for Climate Simulation are other examples of data science being used where the volume of data is so large that it tends towards the new field of Big Data.

 

h. Revenue Management - Real time revenue management is also very well aided by proficient data scientists. In the past, revenue management systems were hindered by a dearth of data points. In the retail industry or the gaming industry too data science is used. As Jian Wang defines it: “Revenue management is a methodology to maximize an enterprise’s total revenue by selling the right product to the right customer at the right price at the right time through the right channel”. Now data scientists have the ability to tap into a constant flow of real-time pricing data and adjust their offers accordingly. It is now possible to estimate the most beneficial type of business to nurture at a given time and how much profit can be expected within a certain time span [12, 20, 22].

 

i. Government - Data science is also used in governmental directorates to prevent waste, fraud and abuse, combat cyber-attacks and safeguard sensitive information, use business intelligence to make better financial decisions, improve defence systems and protect soldiers on the ground. In recent times most governments have acknowledged the fact that data science models have great utility for a variety of missions [1, 8].

 

The use of data science as a quantitative approach to turn information into something valuable has been trending since quite some time. The desire for "the statistician that can code" or "the programmer that knows stats" has arisen from the need to efficiently utilise data by bundling them according to relevance or importance and using the same for information mining.

 

4. Future Scope

For sure the future will be crowded with people trying to applying data science in all problems, kind of overusing it. But it can be sensed that we are going to see some real amazing applications of DS for a normal user apart from online applications (recommendations, ad targeting, etc). The skills needed for visualization, for client engagement, for engineering saleable algorithms, are all quite different. If we can perform everything perfectly at peak level it'd be great. However, if demand is robust enough companies will start accepting a diversification of roles and building teams with complementary skills rather than imagining that one person will cover all bases.

 


Service Customization can be achieved by data science; one can achieve a person[1]level customization in almost any kind of services like healthcare, insurance, public services, banking, etc. We can utilize it to help approach making with the accessibility of most intricate topography level information on characteristic assets like water bodies, mineral stores, area sort/quality, and so on, man-made assets like streets, trains lines, air terminals, open workplaces/foundation, on citizens, their different properties, and their utilization example of items and administrations and even the government can make their approach making to a great degree modified, productive, shrewd, and receptive to changes. Knowledge creates knowledge to make future tasks easier. With more data the process of analysis and implementation becomes more efficient. Fields like Material Sciences, Drug discovery, Quantum mechanics, Neuroscience, Nanotechnology, and many more have greatly benefited from change in method in which studies are done, data analytics have proved to be a far fruitful process than many other .The surge in huge information, examination and intellectual figuring methodologies will give choice backing and computerization to people, and mindfulness and knowledge to machines. These advancements can be utilized to make both people and things more astute.

 

5. Conclusion

The analysis of big data requires traditional tools like SQL, analytical workbenches and data analysis and visualization languages like R. These tools can be used in various fields where data analytics is required. Many more tools have been introduced in the market and the existing products are also under constant improvement. The demand for better analytics tools is increasing constantly which is only going to increase further in future.

 

References

[1] Dhar, V. (2013). "Data science and prediction". Communications of the ACM. 56(12): 64–73. doi:10.1145/2500499.

[2] Jeff Leek (12 December 2013). "The key word in "Data Science" is not Data, it is Science". Simply Statistics.

[3] Hayashi, Chikio (1 January 1998). "What is Data Science? Fundamental Concepts and a Heuristic Example". In Hayashi, Chikio; Yajima, Keiji; Bock, Hans-Hermann; Ohsumi, Noboru; Tanaka, Yutaka; Baba, Yasumasa (eds.). Data Science, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization. Springer Japan. pp. 40–51. doi:10.1007/978-4-431- 65950-1_3. ISBN 9784431702085.

[4] https://www.edureka.co/blog/what-is-data-science/

[5] Davenport, Thomas H. (January 1, 2006). "Competing on Analytics". Harvard Business Review

[6] Glossary of Terms. Machine Learning - Special issue on applications of machine learning and the knowledge discovery process archive. Volume 30 Issue 2-3, Feb/March, 1998. Pages 271-274

[7] Bolton, R. & Hand, D. (2002). Statistical Fraud Detection: A Review (With Discussion). Statistical Science 17(3): 235–255.

[8] Neural data mining for credit card fraud detection. Brause, R.; Langsdorf, T.; Hepp, M. Tools with Artificial Intelligence, 1999. Proceedings. 11th IEEE International Conference on. Publication Year: 1999, Pages: 103-106

[9] Reinhard Klette (2014). Concise Computer Vision. Springer. ISBN 978-1-4471-6320- 6.

[10] Hutchins, W. John; Somers, Harold L. (1992). An Introduction to Machine Translation. London: Academic Press. ISBN 0-12-362830-X.

[11] Akshi Kumar, Teeja Mary Sebastian. Sentiment Analysis on Twitter. IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012. ISSN (Online): 1694-0814

[12] Raul Isea. The Present-Day Meaning of the Word Bioinformatics, Global Journal of Advanced Research, 2015. Vol-2, Issue-1 PP. 70-73. ISSN: 23945788.

[13] Brumfiel, Geoff (19 January 2011). "High-energy physics: Down the petabyte highway". Nature 469. pp. 282–83. doi:10.1038/469282a

[14]Matthew Francis. Future telescope array drives development of exabyte processing. http://arstechnica.com/science/2012/04/future-telescope-array[1]drivesdevelopment- of-exabyte-processing/

[15] "Supercomputing the Climate: NASA's Big Data Mission". CSC World. Computer Sciences Corporation. http://www.csc.com/cscworld/publications/81769/81773- supercomputing_the_climate_nasa_s_big_data_mission 30. Hype Cycle for Emerging Technologies, 2013

[16]https://www.researchgate.net/publication/281405146_Challenges_in_Data_Science _A_Comprehensive_Study_on_Application_and_Future_Trends_1_2

[17] http://ijarcsse.com/Before_August_2017/docs/papers/Volume_6/2_February2016/V 6I2- 0230.pdf

[18] http://ijsrcseit.com/paper/CSEIT1833744.pdf

[19] https://en.wikipedia.org/wiki/Data_science

[20] Sharma, G., & Kumar, A. (2017). Dynamic range normal bisector localization algorithm in wireless sensor networks. Wireless Personal Communications, 97(3), 4529- 4549.

[21] G. Sharma and A. Kumar, “Modified Energy-Efficient Range-Free Localization Using Teaching–Learning-Based Optimization for Wireless Sensor Networks,” IETE Journal of Research, vol. 64, no. 1, pp. 124–138, Jul. 2017.

[22] G. Sharma and A. Kumar, “Fuzzy logic-based 3D localization in wireless sensor networks using invasive weed and bacterial foraging optimization,” Telecommunication Systems, vol. 67, no. 2, pp. 149–162, May 2017.

[23] Jaswanth, G., Kaur, A., & Sharma, G. (2016). Design of Energy Efficient Cluster Head Routing Algorithm for Heterogeneous Wireless Sensor Networks. International Journal of Computer Applications, 148(3).

[24] Sharma, G., & Rajesh, A. (2018). Localization in Wireless Sensor Networks Using TLBO. i-Manager's Journal on Wireless Communication Networks, 7(3), 32.

 

2 comments:

Tell your requirements and How this blog helped you.

Labels

ACTUATORS (10) AIR CONTROL/MEASUREMENT (38) ALARMS (20) ALIGNMENT SYSTEMS (2) Ammeters (12) ANALYSERS/ANALYSIS SYSTEMS (33) ANGLE MEASUREMENT/EQUIPMENT (5) APPARATUS (6) Articles (3) AUDIO MEASUREMENT/EQUIPMENT (1) BALANCES (4) BALANCING MACHINES/SERVICES (1) BOILER CONTROLS/ACCESSORIES (5) BRIDGES (7) CABLES/CABLE MEASUREMENT (14) CALIBRATORS/CALIBRATION EQUIPMENT (19) CALIPERS (3) CARBON ANALYSERS/MONITORS (5) CHECKING EQUIPMENT/ACCESSORIES (8) CHLORINE ANALYSERS/MONITORS/EQUIPMENT (1) CIRCUIT TESTERS CIRCUITS (2) CLOCKS (1) CNC EQUIPMENT (1) COIL TESTERS EQUIPMENT (4) COMMUNICATION EQUIPMENT/TESTERS (1) COMPARATORS (1) COMPASSES (1) COMPONENTS/COMPONENT TESTERS (5) COMPRESSORS/COMPRESSOR ACCESSORIES (2) Computers (1) CONDUCTIVITY MEASUREMENT/CONTROL (3) CONTROLLERS/CONTROL SYTEMS (35) CONVERTERS (2) COUNTERS (4) CURRENT MEASURMENT/CONTROL (2) Data Acquisition Addon Cards (4) DATA ACQUISITION SOFTWARE (5) DATA ACQUISITION SYSTEMS (22) DATA ANALYSIS/DATA HANDLING EQUIPMENT (1) DC CURRENT SYSTEMS (2) DETECTORS/DETECTION SYSTEMS (3) DEVICES (1) DEW MEASURMENT/MONITORING (1) DISPLACEMENT (2) DRIVES (2) ELECTRICAL/ELECTRONIC MEASUREMENT (3) ENCODERS (1) ENERGY ANALYSIS/MEASUREMENT (1) EQUIPMENT (6) FLAME MONITORING/CONTROL (5) FLIGHT DATA ACQUISITION and ANALYSIS (1) FREQUENCY MEASUREMENT (1) GAS ANALYSIS/MEASURMENT (1) GAUGES/GAUGING EQUIPMENT (15) GLASS EQUIPMENT/TESTING (2) Global Instruments (1) Latest News (35) METERS (1) SOFTWARE DATA ACQUISITION (2) Supervisory Control - Data Acquisition (1)