RETRIEVE SUPER STEP
4.0 OBJECTIVES
• The objective of this chapter is to explain in detail the core operations in the Retrieve Super step.
• This chapter explains important guidelines which if followed will prevent the data lake turning into a data swamp.
• This chapter explains another important example related to shipping terminology called Incoterm
• Finally this chapter explains the different possible data sources to retrieve data from.
4.1 INTRODUCTION
• The Retrieve super step is a practical method for importing a data lake consisting of different external data sources completely into the processing ecosystem.
• The Retrieve super step is the first contact between your data science and the source systems.
• The successful retrieval of the data is a major stepping-stone to ensuring that you are performing good data science.
• Data lineage delivers the audit trail of the data elements at the lowest granular level, to ensure full data governance.
• Data governance supports metadata management for system guidelines, processing strategies, policies formulation, and implementation of processing.
• Data quality and master data management helps to enrich the data lineage with more business values, if you provide complete data source metadata.
• The Retrieve super step supports the edge of the ecosystem, where your data science makes direct contact with the outside data world. I will recommend a current set of data structures that you can use to handle the deluge of data you will need to process to uncover critical business knowledge.
4.2 DATA LAKES
• A company’s data lake covers all data that your business is authorized to process, to attain an improved profitability of your business’s core accomplishments.
• The data lake is the complete data world your company interacts with during its business life span.
• In simple terms, if you generate data or consume data to perform your business tasks, that data is in your company’s data lake.
• Just as a lake needs rivers and streams to feed it, the data lake will consume an unavoidable deluge of data sources from upstream and deliver it to downstream partners
4.3 DATA SWAMPS
• Data swamps are simply data lakes that are not managed.
• They are not to be feared. They need to be tamed.
• Following are four critical steps to avoid a data swamp.
1. Start with Concrete Business Questions
2. Data Quality
3. Audit and Version Management
4. Data Governance
4.3.1 Start with
Concrete Business Questions:
• Simply dumping a horde of data into a data lake, with no tangible purpose in mind, will result in a big business risk.
• The data lake must be enabled to collect the data required to answer your business questions.
• It is suggested to perform a comprehensive analysis of the entire set of data you have and then apply a metadata classification for the data, stating full data lineage for allowing it into the data lake.
4.3.2. Data Quality:
• More data points do not mean that data quality is less relevant.
• Data quality can cause the invalidation of a complete data set, if not dealt with correctly.
4.3.3 Aud: it and
Version Management:
• You must always report the following:
• Who used the process?
• When was it used?
• Which version of code was used?
4.3.4 Data
Governance:
• The role of data governance, data access, and data security does not go away with the volume of data in the data lake.
• It simply collects together into a worse problem, if not managed.
• Data Governance can be implemented in the following ways:
• Data Source Catalog
• Business Glossary
• Analytical Model Usage
4.3.4.1. Data Source
Catalog:
• Metadata that link ingested data-to-data sources are a must-have for any data lake.
• Data processing should include the following rules:
• Unique data catalog number
• use YYYYMMDD/ NNNNNN/NNN.
• E.g. 20171230/000000001/001 for data first registered into the metadata registers on December 30, 2017, as data source 1 of data type 1.
• This is a critical requirement.
• Short description
(It should be under 100 characters)
• Country codes and country names
• Ex. ISO 3166 defines Country Codes as per United Nations Sources
• Long description
(It should kept as complete as possible)
• Country codes and country names used by your organization as standard for country entries
• Contact information
for external data source
• ISO 3166-1:2013 code lists from www.iso.org/iso-3166- country-codes.html
• Expected frequency
• Irregular i.e., no fixed frequency, also known as ad hoc, every minute, hourly, daily, weekly, monthly, or yearly.
• Other options are near-real-time, every 5 seconds, every minute, hourly, daily, weekly, monthly, or yearly.
• Internal business
purpose
• Validate country codes and names.
4.3.4.2. Business
Glossary:
• The business glossary maps the data-source fields and classifies them into respective lines of business. • This glossary is a must-have for any good data lake. • The business glossary records the data sources ready for the retrieve processing to load the data. • Create a data-mapping registry with the following information:
• Unique data catalog number: use YYYYMMDD/ NNNNNN/NNN.
• Unique data mapping number: use NNNNNNN/ NNNNNNNNN. E.g., 0000001/000000001 for field 1 mapped to
• internal field 1
• External data source field name: States the field as found in the raw data source
• External data source field type: Records the full set of the field’s data types when loading the data lake
• Internal data source field name: Records every internal data field name to use once loaded from the data lake
• Internal data source field type: Records the full set of the field’s types to use internally once loaded
• Timestamp of last verification of the data mapping: use YYYYMMDD-HHMMSS-SSS that supports timestamp down to a thousandth of a second.
4.3.4.3. Analytical
Model Usage:
• Data tagged in respective analytical models define the profile of the data that requires loading and guides the data scientist to what additional processing is required.
• The following data analytical models should be executed on every data set in the data lake by default.
• Data Field Name Verification
• Unique Identifier of Each Data Entry
• Data Type of Each Data Column
• Histograms of Each Column
• Minimum Value
• Maximum Value
• Mean
• Median
• Mode
• Range
• Quartiles
• Standard Deviation
• Skewness
• Missing or Unknown Values
• Data Pattern
• The models can be applied using R or Python, we will use R
• The data set used to demonstrate the models is INPUT_DATA.csv
• Data Field Name
Verification
• This is used to validate and verify the data field’s names in the retrieve processing in an easy manner.
• Example
library(table)
set_tidy_names(INPUT_DATA, syntactic = TRUE,
quiet = FALSE)
• Reveals field names that are not easy to use
• Unique Identifier
of Each Data Entry
• Allocate a unique identifier within the system that is independent of the given file name.
• This ensures that the system can handle different files from different paths and keep track of all data entries in an effective manner.
• Then allocate a unique identifier for each record or data element in the files that are retrieved.
• Example:To add the unique identifier, run the following command
INPUT_DATA_with_ID=
Row_ID_to_column(INPUT_DATA_FIX, var =
"Row_ID")
• Data Type of Each
Data Column
• Determine the best data type for each column, to assist you in completing the business glossary, to ensure that you record the correct import processing rules.
• Example: To find datatype of each column
sapply(INPUT_DATA_with_ID, typeof)
• Histograms of Each
Column
• I always generate a histogram across every column, to determine the spread of the data value.
• Example: to compute histogram
library(data.table)
country_histogram=data.table(Country=unique(INPUT_D
ATA_with_ID[is.na(INPUT_DATA_with_ID ['Country'])
== 0, ]$Country))
• Minimum Value
• Determine the minimum value in a specific column.
• Example: find minimum value
min(country_histogram$Country)
or
sapply(country_histogram[,'Country'], min, na.rm=TRUE)
• Maximum Value
• Determine the maximum value in a specific column.
• Example: find maximum value
max(country_histogram$Country)
or
sapply(country_histogram[,'Country'], max, na.rm=TRUE)
• Mean
• If the column is numeric in nature, determine the average value in a specific column.
• Example: find mean of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], mean,
na.rm=TRUE)
• Median
• Determine the value that splits the data set into two parts in a specific column.
• Example:find median of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], median,
na.rm=TRUE)
• Mode
• Determine the value that appears most in a specific column.
• Example: Find mode for column country
INPUT_DATA_COUNTRY_FREQ=data.table(with(INPU
T_DATA_with_ID, table(Country)))
• Range
• For numeric values, you determine the range of the values by taking the maximum value and subtracting the minimum value.
• Example: find range of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], range,
na.rm=TRUE
• Quartiles
• These are the base values that divide a data set in quarters. This is done by sorting the data column first and then splitting it in groups of four equal parts.
• Example: find quartile of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], quantile,
na.rm=TRUE)
• Standard Deviation
• the standard deviation is a measure of the amount of variation or dispersion of a set of values.
• Example: find standard deviation of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], sd,
na.rm=TRUE)
• Skewness
• Skewness describes the shape or profile of the distribution of the data in the column.
• Example: find skewness of latitude
library(e1071)
skewness(lattitue_histogram_with_id$Latitude, na.rm =
FALSE, type = 2)
• Missing or Unknown
Values
• Identify if you have missing or unknown values in the data sets.
Example: find missing value in country column
missing_country=data.table(Country=unique(INPUT_DAT
A_with_ID[is.na(INPUT_DATA_with_ID ['Country']) ==
1, ]))
• Data Pattern
• I have used the following process for years, to determine a pattern of the data values themselves.
• Here is my standard version:
• Replace all alphabet values with an uppercase case A, all numbers with an uppercase N, and replace any spaces with a lowercase letter band all other unknown characters with a lowercase u.
• As a result, “Data Science 102” becomes "AAAAbAAAAAAAbNNNu.” This pattern creation is beneficial for designing any specific assess rules.
4.4 TRAINING THE
TRAINER MODEL
• To prevent a data swamp, it is essential that you train your team also. Data science is a team effort.
• People, process, and technology are the three cornerstones to ensure that data is curated and protected.
• You are responsible for your people; share the knowledge you acquire from this book. The process I teach you, you need to teach them. Alone, you cannot achieve success.
• Technology requires that you invest time to understand it fully. We are only at the dawn of major developments in the field of data engineering and data science.
• Remember: A big part of this process is to ensure that business users and data scientists understand the need to start small, have concrete questions in mind, and realize that there is work to do with all data to achieve success.
4.5 SHIPPING
TERMINOLOGIES
In this section we discuss two things : shipping terms and Incoterm 2010.
4.5.1 Shipping Terms
• These determine the rules of the shipment, the conditions under which it is made. Normally, these are stated on the shipping manifest.
• Following are the terms used:
• Seller - The person/company sending the products on the shipping manifest is the seller. This is not a location but a legal entity sending the products.
• Carrier - The person/company that physically carries the products on the shipping manifest is the carrier. Note that this is not a location but a legal entity transporting the products.
• Port - A Port is any point from which you have to exit or enter a country. Normally, these are shipping ports or airports but can also include border crossings via road. Note that there are two ports in the complete process. This is important. There is a port of exit and a port of entry.
• Ship - Ship is the general term for the physical transport method used for the goods. This can refer to a cargo ship, airplane, truck, or even person, but it must be identified by a unique allocation number.
• Terminal - A terminal is the physical point at which the goods are handed off for the next phase of the physical shipping.
• Named Place - This is the location where the ownership is legally changed from seller to buyer. This is a specific location in the overall process. Remember this point, as it causes many legal disputes in the logistics industry.
• Buyer - The person/company receiving the products on the shipping manifest is the buyer. In our case, there will be warehouses, shops, and customers. Note that this is not a location but a legal entity receiving the products.
4.5.2 Incoterm 2010:
• Incoterm 2010 is a summary of the basic options, as determined and published by a standard board
• This option specifies which party has an obligation to pay if something happens to the product being shipped (i.e. if the product is damaged or destroyed inroute before it reaches to the buyer)
• EXW—Ex Works
• Here the seller will make the product or goods available at his premises or at another named place. This term EXW puts the minimum obligations on the seller of the product /item and maximum obligation on the buyer.
• Here is the data science version: If I were to buy an item a local store and take it home, and the shop has shipped it EXW—Ex Works, the moment I pay at the register, the ownership is transferred to me. If anything happens to the book, I would have to pay to replace it.
• FCA—Free Carrier
• In this condition, the seller is expected to deliver the product or goods, that are cleared for export, at a named place.
• The data science version: : If I were to buy an item at an overseas duty-free shop and then pick it up at the duty-free desk before taking it home, and the shop has shipped it FCA— Free Carrier— to the duty-free desk, the moment I pay at the register, the ownership is transferred to me, but if anything happens to the book between the shop and the duty-free desk, the shop will have to pay.
• It is only once I pick it up at the desk that I will have to pay, if anything happens. So, the moment I take the book, the transaction becomes EXW, so I have to pay any necessary import duties on arrival in my home country.
• CPT—Carriage Paid
To
• Under this term, the seller is expected topay for the carriage of product or goods up to the named place of destination.
• The moment the product orgoods are delivered to the first carrier they are considered to be delivered, and the risk getstransferred to the buyer.
• All the costs including origin costs, clearance of export and freight costs for carriage till the place of named destination have to be paid by the sellerto the named place of destination. This is could be anything like the final destination like the buyer's facility, or a port of at the destination country. This has to be agreed upon by both seller and buyer in advance.
• The data science version: : If I were to buy an item at an overseas store and then pick it up at the export desk before taking it home and the shop shipped it CPT—Carriage Paid To—the duty desk for free, the moment I pay at the register, the ownership is transferred to me, but if anything happens to the book between the shop and the duty desk of the shop, I will have to pay.
• It is only once I have picked up the book at the desk that I have to pay if anything happens. So, the moment I take the book, the transaction becomes EXW, so I must pay any required export and import duties on arrival in my home country.
• CIP - Carriage&
Insurance Paid
• The seller has to get insurance for the goods for shipping the goods.
• The data science version If I were to buy an item at an overseas store and then pick it up at the export desk before taking it home, and the shop has shipped it CPT—Carriage Paid To— to the duty desk for free, the moment I pay at the register, the ownership is transferred to me. However, if anything happens to the book between the shop and the duty desk at the shop, I have to take out insurance to pay for the damage.
• It is only once I have picked it up at the desk that I have to pay if anything happens. So, the moment I take the book, it becomes EXW, so I have to pay any export and import duties on arrival in my home country. Note that insurance only covers that portion of the transaction between the shop and duty desk.
• DAT—Delivered at a
Terminal
• According to this term the seller has to deliver and unloadthe goods at a named terminal. The seller assumes all risks till thedelivery at the destination and has to pay all incurred costs of transport including export fees, carriage, unloading from the main carrier at destination port, and destination port charges.
• The terminal can be a port, airport, or inland freight interchange, but it must be a facility with the capability to receive the shipment. If the seller is not able to organize unloading, it should consider shipping under DAP terms instead. All charges after unloading (for example, import duty, taxes, customs and on-carriage costs) are to be borne by buyer.
• The data science version. If I were to buy an item at an overseas store and then pick it up at a local store before taking it home, and the overseas shop shipped it—Delivered at Terminal (Local Shop)—the moment I pay at the register, the ownership is transferred to me.
• However, if anything happens to the book between the payment and the pickup, the local shop pays. It is picked up only once at the local shop. I have to pay if anything happens. So, the moment I take it, the transaction becomes EXW, so I have to pay any import duties on arrival in my home.
• DAP—Delivered at
Place
• Under this option the seller delivers the goods at a given place of destination. Here, the risk willpass from seller to buyer from destination point.
• Packaging cost at the origin has to be paid by the seller alsoall the legal formalities in the exporting country will be carried out by the seller at his own expense.
• Once the goods are delivered in the destination country the buyer has to pay for the customs clearance.
• Here is the data science version. If I were to buy 100 pieces of a particular item from an overseas web site and then pick up the copies at a local store before taking them home, and the shop shipped the copies DAP-Delivered At Place (Local Shop)— the moment I paid at the register, the ownership would be transferred to me. However, if anything happened to the item between the payment and the pickup, the web site owner pays. Once the 100 pieces are picked up at the local shop, I have to pay to unpack them at store. So, the moment I take the copies, the transaction becomes EXW, so I will have to pay costs after I take the copies.
• DDP—Delivered Duty
Paid
• Here the seller is responsible for the delivery of the products or goods to an agreed destination place in the country of the buyer. The seller has to pay for all expenses like packing at origin, delivering the goods to the destination, import duties and taxes, clearing customs etc.
• The seller is not responsible for unloading. This term DDP will place the minimum obligations on the buyer and maximum obligations on the seller. Neither the risk nor responsibility is transferred to the buyer until delivery of the goods is completed at the named place of destination.
• Here is the data science version. If I were to buy an item in quantity 100 at an overseas web site and then pick them up at a local store before taking them home, and the shop shipped DDP— Delivered Duty Paid (my home)—the moment I pay at the till, the ownership is transferred to me. However, if anything were to happen to the items between the payment and the delivery at my house, the store must replace the items as the term covers the delivery to my house.
4.6 OTHER DATA
SOURCES /STORES
• While performing data retrieval you may have to work with one of the following data stores
• SQLite
• This requires a package named sqlite3.
• Microsoft SQL
Server
• Microsoft SQL server is common in companies, and this connector supports your connection to the database. Via the direct connection, use
from sqlalchemy import create_engine
engine =
create_engine('mssql+pymssql://scott:tiger@hostname:port/folder')
• Oracle
• Oracle is a common database storage option in bigger companies. It enables you to load data from the following data source with ease:
from sqlalchemy import create_engine
engine =
create_engine('oracle://andre:vermeulen@127.0.0.1:1521/vermeulen')
• MySQL
• MySQL is widely used by lots of companies for storing data. This opens that data to your data science with the change of a simple connection string.
• There are two options. For direct connect to the database, use
from sqlalchemy import create_engine
engine =
create_engine('mysql+mysqldb://scott:tiger@localhost/vermeulen')
• Apache Cassandra
• Cassandra is becoming a widely distributed databaseengine in the corporate world.
• To access it, use the Python package cassandra.
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect(‘vermeulen’)
• Apache Hadoop
• Hadoop is one of the most successful data lake ecosystems in highly distributed data Science.
• The pydoop package includes a Python MapReduce and HDFS API for Hadoop.
• Pydoop 9
• It is a Python interface to Hadoop that allows you to write MapReduce applications and interact with HDFS in pure Python
• Microsoft Excel
• Excel is common in the data sharing ecosystem, and it enables you to load files using this format with ease.
• Apache Spark
• Apache Spark is now becoming the next standard for distributed data processing. The universal acceptance and support of the processing ecosystem is starting to turn mastery of this technology into a must-have skill.
• Apache Hive
• Access to Hive opens its highly distributed ecosystem for use by data scientists.
• Luigi
• Luigi enables a series of Python features that enable you to build complex pipelines into batch jobs. It handles dependency resolution and workflow management as part of the package.
• This will save you from performing complex programming while enabling good quality processing
• Amazon S3 Storage
• S3, or Amazon Simple Storage Service (Amazon S3), creates simple and practical methods to collect, store, and analyze data, irrespective of format, completely at massive scale. I store most of my base data in S3, as it is cheaper than most other methods.
• Package s3 - Python’s s3 module connects to Amazon’s S3 REST API
• Package Boot - The Botopackage is another useful too that connects to Amazon’s S3 REST API
• Amazon Redshift
• Amazon Redshift is cloud service that is a fully managed, petabyte-scale data warehouse.
• The Python package redshift-sqlalchemy, is an Amazon Redshift dialect for sqlalchemythat opens this data source to your data science
• Amazon Web Services
• The boto3package is an Amazon Web Services Library Python package that provides interfaces to Amazon Web Services
4.7 UNIT END QUESTION
1. Explain the Retrieve Superstep.
2. Explain Data Lakes and Data Swamps.
3. Explain the general rules for data source catalog.
4. State and explain the four critical steps to avoid data swamps.
5. Why is it necessary to train the data science team?
6. Explain the following shipping terms:
i Seller,
ii. Carrier,
iii. Port,
iv. Ship,
v. Terminal, Named Place,
vi Buyer.
7. Explain the following shipping terms with example:
i. Ex Works
ii. Free Carrier
iii. Carriage Paid To
iv. Carriage and Insurance Paid To
v. Delivered at Terminal
vi. Delivered at Place
vii. Delivery Duty Paid
8. List and explain the different data stores used in data science.
4.8 REFERENCES
Books:
• Andreas François Vermeulen, “Practical Data Science - A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets”
Websites:
• https://www.aitworldwide.com/incoterms
• Incoterm: https://www.ntrpco.com/what-is-incoterms-part2/
No comments:
Post a Comment
Tell your requirements and How this blog helped you.