Industries Needs: DATA SCIENCE TECHNOLOGY STACK

VERMEULEN-KRENNWALLNER[1]HILLMAN-CLARK

• Vermeulen-krennwallner-Hillman-Clark is small group like VKHCG and has a small size international company and it consist of 4 subcomponent 1. Vermeulen PLC, 2. Krennwallner AG, 3. Hillman Ltd 4. Clark Ltd.

1. Vermeulen PLC:

• Vermeulen PLC is a data processing company which process all the data within the group companies.

• This is the company for which we hire most of the data engineer and data scientist to work with it.

• This company supplies data science tool, Network, server and communication system, internal and external web sites, decision science and process automation.

2. Krennwallner AG:

• This is an advertising and media company which prepares advertising and media information which is required for the customers.

• Krennwallner supplies advertising on billboards, make Advertising and content management for online delivery etc.

• By using the number of record and data which are available on internet for media stream, it takes the data from there and make an analysis on this according to that it searches which of the media stream are watched by customer, how many time and which is most watchable content on internet.

• By using the survey, it specifies and choose content for the billboards, make and understand how many times customer are visited for which channel.

3. Hillman Ltd:

• This is logistic and supply chain company and it is used to supply the data around the worldwide for the business purpose.

• This include client warehouse, international shipping, home – to – home logistics.

4. Clark Ltd:

• This is the financial company which process all financial data which is required for financial purpose includes Support Money, Venture Capital planning and allow to put your money on share market.

Scala:

• Scala is a general-purpose programming language and it support functional programming and a strong type statics type system.

• Most of the data science project and framework are built by using the Scala programming language because it has so many capabilities and potential to work with it.

• Scala integrate the feature of object-oriented language and its function because Scala can be written in java, C++, python language.

• Types and behavior of objects are described by the class and class can be extended by another class by using its properties.

• Scala support the high-level functions and function can be called by another function by using and written the function in a code.

Apache Spark:

• Apache Spark is an open source clustering computing framework. The word open source means it is freely available on internet and just go on internet and type apache spark and you will get freely source code are available there, you can download and according to your wish.

• Apache Spark was developed at AMP Lab of university of California, Berkeley and after that all the code and data was donated to Apache Software Foundation for keep doing changes over a time and make it more effective, reliable, portable that will run all the platform.

• Apache Spark provide an interface for the programmer and developer to directly interact with the system and make data parallel and compatible with data scientist and data engineer.

• Apache Spark has the capabilities and potential, process all types and variety of data with repositories including Hadoop distributed file system, NoSQL database as well as apache spark.

• IBM are hiring most of the data scientist and data engineer to whom has more knowledge and information about apache spark project so that innovation could be perform an easy way and will come up with more feature and changing.

Apache Mesos:

• Apache Mesos is an open source cluster manager and it was developed by the universities of California, Berkeley.

• It provides all the required resource for the isolation and sharing purpose across all the distributed application.

• The software we are using for Mesos, provide resources sharing in a fine-grained manner so that improvement can be done with the help of this.

• Mesosphere enterprises DC/OS is the enterprise version of Mesos and this run specially on Kafka, Cassandra, spark and Akka.

Akka:

• Akka is an actor-based message driven runtime for running concurrency, elasticity, and resilience processes.

• The actor can be controlled and limited to perform the intended task only. Akka is an open source library or toolkit.

• Apache Akka is used to create distributed and fault tolerant and it can be integrated to the library into the java virtual machine or JVM to support the language.

• Akka could be integrated with the Scala programming language and it is written in the Scala and it help us and developers to deal with external locking and threat management.

Apache Cassandra:

• Apache Cassandra is an open source distributed database system that is designed for storing and managing large amount of data across commodity servers.

• Cassandra can be used for both real time operational data store for online transaction data application.

• Cassandra is designed to have peer to peer process continuing nodes instead of master or named nodes to ensure that there should not be any single point of failure.

• Apache Cassandra is a highly scalable, high performance distributed database designed to handle large amounts of data and it is type of NoSQL Database.

• A NoSQL database is a database that provide mechanism to store and retrieve the data from the database than relational database.

Kafka:

• Kafka is a high messaging backbone that enables communication between data processing entities and Kafka is written in java and Scala language.

• Apache Kafka is highly scalable, reliable, fast, and distributed system. Kafka is suitable for both offline and online message consumption.

• Kafka messages are stored on the hard disk and replicated within the cluster to prevent the data loss.

• Kafka is distributed, partitioned, replicated and fault tolerant which make it more reliable.

• Kafka messaging system scales easily without down time which make it more scalable. Kafka has high throughput for both publishing and subscribing messages, and it can store data up to TB.

Python:

• Python is a programming language and it can used on a server to create web application.

• Python can be used for web development, mathematics, software development and it is used to connect the database and create and modify the data.

• Python can handle the large amount of data and it is capable to perform the complex task on data.

• Python is reliable, portable, and flexible to work on different platform like windows, mac, and Linux etc.

• Python can be installed on all the operating system example windows, Linux and mac operating system and it can work on all these platforms for better understanding and learning purpose.

You can earn much more knowledge by installing and working all three platform for data science and data engineering.

• To working and installing the data science required package in python,Ubunturun the following command below:

• sudo apt-get install python3 python3-pip python3-setuptools

• To working and installing the data science required package in python, Linuxrun the following command below:

• sudo yum install python3 python3-pip python3-setuptools

• To work and installthe data science required package in python, Windows run the following command below:

• https://www.python.org/downloads/

• Python Libraries:

• Python library is a collection of functions and methods that allows you to perform many actions without writing your code.

• Pandas:

• Pandas stands for panel data and it is the core library for data manipulation and data analysis.

• It consists of single and multidimensional data for data analysis.

• How to install pandas in UBUNTU by usingthe following commands:

• sudo apt-get install python-pandas

• How to install pandas in LINUX by usingthe following commands:

• yum install python-pandas

• How to install pandas in WINDOWS by using the following commands:

• pip install pandas

Matplotlib:

• Matplotlib is used for data visualization and is one of the most important packages of python.

• Matplotlib is used to display and visualize the 2D data and it is written in python.

• It can be used for python, Jupiter, notebook and web application server also.

• How to install Matplotlib Library for UBUNTU in python by using the following command:

• sudo apt-get install python-matplotlib

• How to install Matplotlib Library for LINUX in python by using the following command:

• Sudo yum install python-matplotlib

• How to install Matplotlib Library for WINDOWS in python by using the following command:

• pip install matplotlib

NumPy:

• NumPy is the fundamental package of python language and is used for the numerical purpose.

• NumPy is used with the SciPy and Matplotlib package of python and it is freely available on internet.

SymPy:

• Sympy is a python library and which is used for symbolic mathematics and it can be used with complex algebra formula.

• R is a programming language and it is used for statistical computing and graphics purpose.

• R Language is used by data engineer, data scientist, statisticians, and data miners for developing the software and performing data analytics.

• There is core requirement before learning the R Language and some depend on library and package concept that you should know about it and know how to work upon it easily.

• The related packages are of R Language is sqldf, forecast,dplyr,stringer, lubridate, ggplot2, reshape etc.

Friday, February 11, 2022

DATA SCIENCE TECHNOLOGY STACK

No comments:

Post a Comment

Labels

INSTRUMENTATION MANUFACTURERS