Industries Needs: January 2022

Monday, January 31, 2022

Big Data Related Technologies, Challenges and Future Prospects

Open Issues and Outlook

Abstract In the previous chapters, we review the background and state-of-the-art of big data. In Fig. 7.1, it illustrates all the key technologies of big data introduced in this book. In this chapter, we summarize the research hot spots and suggest possible research directions of big data. We also discuss potential development trends in this broad research and application area.

7.1 Open Issues

The analysis of big data is confronted with many challenges but the current research is still in the beginning phase. Considerable research efforts are needed to improve the efficiency of data display, data storage, and data analysis.

7.1.1 Theoretical Research

Although big data is a hot research area in both academia and industry, there are many important problems remain to be solved, which are discussed below.

• Fundamental Problems: There is compelling need for a rigorous definition of big data, a structural model of big data, formal description of big data, and a theoretical system of data science, etc. At present, many discussions of big data look more like commercial speculation than scientific research. This is because big data is not formally and structurally defined and not strictly verified.

• Standardization: An evaluation system of data quality and an evaluation standard of data computing efficiency should be developed. Many solutions of big data applications claim they can improve data processing and analysis capacities in all aspects, but there is still not a unified evaluation standard and benchmark to balance the computing efficiency of big data with rigorous mathematical

methods. The performance can only be evaluated by the system is implemented and deployed, which could not horizontally compare advantages and disadvan[1]tages of various solutions and compare efficiencies before and after the use of big data. In addition, since data quality is an important basis of data preprocessing, simplification, and screening, it is also an urgent problem to effectively evaluate data quality.

• Evolution of Big Data Computing Modes: This includes external storage mode, data flow mode, PRAM mode, and MR mode, etc. The emergence of big data triggers the development of algorithm design, which has transformed from a computing-intensive approach into a data-intensive approach. Data transfer has been a main bottleneck of big data computing. Therefore, many new computing models tailored for big data have emerged and more such models are on the horizon.

7.1.2 Technology Development

The big data technology is still in its infancy. Many key technical problems, such as cloud computing, grid computing, stream computing, parallel computing, big data architecture, big data model, and software systems supporting big data, etc. should be fully investigated.

• Format Conversion: Due to wide and various data sources, heterogeneity is always a characteristic of big data, as well as a key factor which restricts the efficiency of data format conversion. If such format conversion can be made more efficient, the application of big data may create more values.

• Big Data Transfer: Big data transfer involves big data generation, acquisition, transmission, storage, and other data transformations in the spatial domain. As discussed, big data transfer usually incurs high costs, which is also the bottleneck for big data computing. However, data transfer is inevitable in big data applications. Improving the transfer efficiency of big data is a key factor to improve big data computing.

• Real-time Performance: The real-time performance of big data is also a core problem in many different application scenarios. Ways to define the life cycle of data, compute the rate of depreciation of data, and build computing models of real-time applications and online applications, will influence the values and analytical and feedback results of big data.

As big data research is advanced, new problems on big data processing arise from the traditional simple data analysis, including: (a) data re-utilization, since big data features big value but low density, with the increase of data scale, more values may be mined from re-utilization of existing data; (b) data re-organization, datasets in different businesses can be re-organized, with the total re-organized data values larger than the total datasets’ value; (c) data exhaust, unstructured information or data that is a by-product of the online activities of Internet users. In big data, not only correct data should be utilized, but also the wrong data should be utilized to generate more value. Collecting and analyzing data exhaust can provide valuable insight into the purchasing habits of consumers.

7.1.3 Practical Implications

Although there are already many successful big data applications, many practical problems should be solved:

• Big Data Management: the emergence of big data brings about new challenges to traditional data management. At present, many research efforts are being made on consider big data oriented database and Internet technologies, management of storage models and databases of new hardware, heterogeneous and multi[1]structured data integration, data management of mobile and pervasive computing, data management of SNS, and distributed data management.

• Searching, Mining, and Analysis of Big Data: data processing is always a research hotspot in the big data field, e.g., searching and mining of SNS models, big data searching algorithms, distributed searching, P2P searching, visualized analysis of big data, massive recommendation systems, social media systems, real-time big data mining, image mining, text mining, semantic mining, multi[1]structured data mining, and machine learning, etc.

• Integration and Provenance of Big Data: As discussed, the value acquired from a comprehensive utilization of multiple datasets is higher than the total value of individual datasets. Therefore, the integration of different data sources is a timely problem to be solved. Data integration is to integrate different datasets from different sources, which are confronted with many challenges, such as different data patterns and large amount of redundant data. Data provenance is to describe the process of data generation and evolution over time. In the big data era, data provenance is mainly used to investigate multiple datasets other than a single dataset. Therefore, it is worth of study on how to integrate data provenance information featuring different standards and from different datasets.

• Big Data Application: at present, the application of big data is just beginning and we shall explore and more efficiently ways to fully utilize big data. Therefore, big data applications in science, engineering, medicine, medical care, finance, busi[1]ness, law enforcement, education, transportation, retail, and telecommunication, big data applications in small and medium-sized businesses, big data applications in government departments, big data services, and human-computer interaction of big data, etc. are all important research problems.

7.1.4 Data Security

In IT, safety and privacy are always two key concerns. In the big data era, as data volume is fast growing, there are more severe safety risks, while the traditional data protection methods have already been shown not applicable to big data. In particular, big data safety is confronted with the following security related challenges.

• Big Data Privacy: In the big data era, data privacy includes two aspects: (a) the protection of personal privacy, as the advances on data acquisition is made, personal interests, habits, and body properties, etc. of users may be more easily acquired, of which the user may not be aware. (b) Personal privacy data may also be leaked during storage, transmission, and usage, even if acquired with the permission of users. Facebook is deemed as a big data company with the most SNS data currently. Organizations that own big data usually attempt to mine valuable information in the data with advanced algorithms. The privacy data protection technology therefore is of great importance. According to a report, Ron Bowes, a researcher of Skull Security, acquired data in the public pages of Facebook users who fail to modify their privacy setting using an information acquisition tool. Ron Bowes packaged such data into a 2.8 GB package and created a BT seed for others to download. The analysis capacity of big data may lead to privacy mining from seemingly simple information. Therefore, privacy protection in the big data era will become a new and challenging problem.

• Data Quality: Data quality influences big data utilization. Low quality data wastes transmission and storage resources, and may not be usable. There are a lot of factors that may restrict data quality, for example, generation, acquisition, transmission, and transmission may all influence data quality. Data quality is mainly manifested in its accuracy, completeness, redundancy, and consistency. Even though a lot of measures have been taken to improve data quality, the quality related problems could not be completely solved. Therefore, effective methods to automatically detect data quality and repair some damaged data need to be investigated.

• Big Data Safety Mechanism: Big data brings challenges to data encryption due to its large scale and high variety. The performance of previous encryption methods on small and medium-scale data could not meet the demands of big data; efficient big data cryptography approaches shall be developed. Effective schemes for safety management, access control, and safety communications shall be investigated for structured, semi-structured, and unstructured data. In addition, under the multi-tenant mode, isolation, confidentiality, completeness, availability, controllability, and traceability of tenants’ data should be enabled on the premise of efficiency assurance.

• Big Data Application in Information Security: Big data not only brings chal[1]lenges to information security, but also offers new opportunities for the devel[1]opment of information security mechanisms. For example, we may discover potential safety loopholes and APT (Advanced Persistent Threat) after the analysis of the big data in the form of log files of an Intrusion Detection System. In addition, virus characteristics, loophole characteristics, and attack characteristics, etc. may also be more easily identified through the analysis of big data.

To sum up, the safety of big data has drawn great attention of researchers. However, there is only limited research on the representation of multi-source het[1]erogeneous big data, measurement and semantic comprehension methods, modeling theories and computing models, distributed storage of energy efficiency optimiza[1]tion, and processed hardware and software system architectures, etc. Particularly, big data security, including big data credibility, big data backup and recovery technologies in various application fields, big data completeness maintenance technology, and big data security technology should be further investigated.

7.2 Outlook

The emergence of big data opens great opportunities. In the IT era, the “T” (Technology) was the main concern, while technology derives the development of data. In the big data era, with the prominence of data value and the advances in I (Information), data will drive the progress of technologies in the future. Big data will not only change the social and economic life, but also influence everyone’s ways of living and thinking, which is just beginning. We could not predict the future but may take precautions for possible events to occur in the future.

• Data With a Larger Scale, More Variety, and More Complex Structures: Although technologies represented by Hadoop have achieved a great success, such tech[1]nologies are definitely to fall behind and will be replaced given the rapid development of big data. For example, the theoretical basis of Hadoop has emerged as early as 2006. Many researchers have concerned ways to better cope with larger-scale, more various kinds of, and more complexly structured data. These efforts are represented by (Globally-Distributed Database) Spanner of Google and fault-tolerant and expandable distributed relational database F1. In the future, the storage technology of big data will be based on distributed databases, support transaction mechanisms similar to relational databases, and effectively handle data through grammars similar to SQL.

• Data Resource Performance: Since big data contains huge values, mastering big data means mastering resources. Through the analysis of the value chain of big data, it can be seen that its value comes from the data itself, technologies, and ideas, and the core is data resources. Without data technologies and ideas, values could not be created. The reorganization and integration of different datasets can create more values. From now on, enterprises that master big data resources may obtain huge benefits by renting and assigning the rights to use their data.

• Big Data Promotes the Cross Fusion of Science: Big data not only promotes the comprehensive fusion of cloud computing, Internet of Things, data center, and mobile networks, etc., but also forces the cross fusion of many disciplines. The development of big data shall explore innovative technologies and methods in big data acquisition, storage, processing, mining, and information security, etc., based on information science, and examine changes and impacts of big data on production management, business operation and decision making, etc. of modern enterprises from the management perspective. What’s more, the application of big data to specific fields needs the participation of interdisciplinary talents.

• Visualization: In many human-computer interaction scenarios, the principle of What You See Is What You Get is followed, e.g., text and image editors. In big data applications, mixed data may not be is very useful for decision making. Only when the analytical results are friendly displayed, it may be accepted and utilized by users. Reports, histograms, pie charts, and regression curves, etc., are frequently used to visualize results of data analysis. New presentation forms will occur in the future, e.g., Microsoft Renlifang, a social search engine, utilizes relational diagrams to express interpersonal relationship.

• Data-Oriented: It is well-known that programs are consisted of data structures and algorithms. In the history of program design, it is observed that the role of data is becoming increasingly more significant. In the small scale data era, in which logic is more complex than data, program design is mainly focused on processes. As business data is becoming more complex, object-oriented design methods are developed. The complexity of business data has far surpassed business logic and programs gradually transform from algorithm-intensive to data-intensive. It is anticipated data-oriented program design methods are certain to emerge, which will have far-reaching influence on the development of IT in software engineering, architecture, and model design, among others.

• Big Data Causes the Revolution of Thinking: In the big data era, data collection, acquisition, and analysis are more rapidly accomplished and the massive data will profoundly influence our ways of thinking. In [2], the authors summarizes the thinking revolution caused by big data as follows:

– During data analysis, we will try to utilize all data other than only analyzing a little sample data.

– Compared with accurate data, we would like to accept numerous and compli[1]cated data.

– We shall pay greater attention to correlations between things other than exploring causal relationship. – The simple algorithms of big data are more effective than complex algorithms of small data.

– Analytical results of big data will reduce hasty and subjective factors during decision making and data scientists will replace “experts.”

• Managing Large-scale FlowTable for Software-Defined Networking with Big Data Techniques: In the past few years, software-defined networking (SDN) has been the buzz of the networking world. It was originally proposed to accelerate networking innovations in legacy campus networks called OpenFlow, which comprises a number of closed networking boxes with diverse functionalities (such as routing, switching, firewall, etc.). It is observed that, plenty of emerging networking problems appeared in the era when cloud computing meets big data applications, and SDN seems to be extremely suitable for solving those problems in respect of network efficiency, scalability, flexibility, agility, as well as operation and maintenance complexity. In the specification of OpenFlow, one of the most important concept is FlowTable, which includes a large number of rules to process network packets. Obviously, it is a challenge to manage the large-scale FlowTables. A promising way is to implement SDN with big data techniques, to effectively store, process and utilize FlowTable, and increase the speed of searching rules.

• 5G Wireless Networks: Supporting Technology for Mobile Big Data: With the emergence of cloud computing as an important information technology in support of virtualized services, it becomes promising to design 5G wireless networks by exploiting recent advances relevant to network function virtualiza[1]tion and benefiting from advanced virtualization techniques of cloud computing to build efficient and scalable networking infrastructures. Researchers have been designing new architectures for elastically composing and operating a virtual end-to-end network platform on demand on top of fragmented physical infrastructures provided by federated cloud. SDN techniques have been seen as promising enablers for this vision of carrier cloud, which will likely play a crucial role in the design of 5G wireless networks

Due to the huge explosion in mobile data of a hyperconnected society, “Can Big Data go Mobile?” now becomes a challenging problem which would be addressed by 5G technologies. Though 5G wireless provides the possibility to enable the mobility of big data, there are various research problems towards the realization of the brand-new networking system, such as 5G network archi[1]tecture, SDN and network virtualization techniques for enabling 5G, resource allocation algorithms in 5G, and 5G-related control protocols and optimization techniques. In an energy efficient, flexible, connectivity-scalable and secure manner, new business models beyond IaaS, PaaS and SaaS, such as Network as a Service (NaaS), and Knowledge as a Service (KaaS), are expected to emerge. Especially, Big Data as a Service (BDaaS) or Big Data Analysis as a Service (BDAaaS) could emerge, facilitating the efficient storage and analysis for the exploding mobile data.

Throughout the history of human society, the demands and willingness of human beings are always the source powers to promote scientific and technological progress. In the big data era, big data may provides reference answers for human beings to make decisions through mining and analytical processing, but could not replace human thinking. It is human thinking that promotes the widespread utilizations of big data. Big data is more like an extendable and expandable human brain other than a substitute of human brain. With the emergence of Internet of Things, development of mobile sensing technology, and progress of data acquisition technology, people are not only the user and consumer of big data, but also its producer and participant. Social relation sensing, crowdsourcing, analysis of big data in SNS, and other applications closely related to human activities based on big data will be increasingly concerned and will certainly cause enormous changes of social activities in the future society.

Big Data Related Technologies, Challenges and Future Prospects

Big Data Applications

Abstract In the previous chapter, we examined big data analysis, which is the final and most important phase of the value chain of big data. Big data analysis can provide useful values via judgments, recommendations, supports, or decisions. However, data analysis involves a wide range of applications, which frequently change and are extremely complex. In this chapter, the evolution of data sources is reviewed. Then, six of the most important data analysis fields are examined, including structured data analysis, text analysis, website analysis, multimedia analysis, network analysis, and mobile analysis. This chapter is concluded with a discussion of several key application fields of big data.

6.1 Application Evolution

Recently, big data and big data analysis has been proposed for describing datasets and as analytical technologies in large-scale complex programs, which need to be analyzed with advanced analytical methods. As a matter of fact, data driven applications have emerged in the past decades. For example, as early as 1990s, business intelligence has became a prevailing technology for business applications and, network search engines based on massive data mining processing emerged in the early twenty-first century. Some potential and influential applications from different fields and their data and analysis characteristics are discussed as follows.

• Evolution of Commercial Applications: The earliest business data was generally structured data, which was collected by companies from old systems and then stored in RDBMSs. Analytical technologies used in such systems were prevailing in 1990s and was intuitive and simple, e.g., reports, instrument panels, special queries, search-based business intelligence, online transaction processing, inter[1]active visualization, score cards, predictive modeling, and data mining. Since the beginning of twenty-first century, networks and websites has been providing a unique opportunity for organizations to have online display and directly interact with customers. Abundant products and customer information, including click stream data logs and user behavior, etc., can be acquired from the websites. Product layout optimization, customer trade analysis, product suggestions, and market structure analysis can be conducted by text analysis and website mining technologies. As reported in, the quantity of mobile phones and tablet PC first surpassed that of laptops and PCs in 2011. Mobile phones and Internet of Things based on sensors are opening a new generation of innovation applications, and searching for larger capacity of supporting location sensing, people oriented, and context operation.

• Evolution of Network Applications: The early Internet mainly provided email and webpage services. Text analysis, data mining, and webpage analysis technologies have been applied to the mining of email contents and building search engines. Nowadays, most applications are web-based, regardless of their application field and design goals. Network data accounts for a major percentage of the global data volume. Web has became a common platform for interconnected pages, full of various kinds of data, such as text, images, videos, pictures, and interactive contents, etc. Therefore, plentiful advanced technologies used for semi-structured or unstructured data emerged at the right moment. For example, the image analysis technology may extract useful information from pictures, e.g., face recognition. Multimedia analysis technologies can be applied to the automated video surveillance systems for business, law enforcement, and military applications. Since 2004, online social media, such as Internet forums, online communities, blogs, social networking services, and social multimedia websites, etc., provide users with great opportunities to create, upload, and share contents generated by users. Different user groups may search for daily news and celebrity news, publish their social and political opinions, and provide different applications with timely feedback.

• Evolution of Scientific Applications: Scientific research in many fields is acquiring massive data with high-throughput sensors and instruments, such as astro[1]physics, oceanology, genomics, and environmental research. The U.S. National Science Foundation (NSF) has recently announced the BIGDATA Research Initiative to promote research efforts to extract knowledge and insights from large and complex collections of digital data. Some scientific research disciplines have developed massive data platforms and obtained useful outcomes. For example, in biology, iPlant applies network infrastructure, physical computing resources, coordination environment, virtual machine resources, and inter-operative analysis software and data service to assist researches, educators, and students in enriching all plant sciences. IPlant dataset have high varieties in form, including specification or reference data, experimental data, analog or model data, observation data, and other derived data.

6.2 Big Data Analysis Fields

Data analysis research can be divided into six key technical fields, i.e., structured data analysis, text data analysis, website data analysis, multimedia data analysis, network data analysis, and mobile data analysis. Such a classification aims to emphasize data characteristics, but some of the fields may utilize similar technolo[1]gies. Since data analysis has a broad scope and it is not easy to have a comprehensive coverage, we will focus on the key problems and technologies in data analysis in the following discussions.

6.2.1 Structured Data Analysis

Business applications and scientific research may generate massive structured data, of which the management and analysis rely on mature commercialized technologies, such as RDBMS, data warehouse, OLAP, and BPM (Business Process Management). Data analysis is mainly based on data mining and statistical analysis, both of which have been well studied over the past 30 years.

Data analysis is still a very active research field and new application demands drive the development of new methods. Statistical machine learning based on exact mathematical models and powerful algorithms have been applied to anomaly detection and energy control. Exploiting data characteristics, time and space mining may extract knowledge structures hidden in high-speed data flows and sensor data models and modes. Driven by privacy protection in e-commerce, e-government, and health care applications, privacy protection data mining is an emerging research field. Over the past decade, benefited by the substantial popu[1]larization of event data, new process discovery, and consistency check technologies, process mining is becoming a new research field especially in process analysis with event data.

6.2.2 Text Data Analysis

The most common format of information storage is text, e.g., email communication, business documents, web pages, and social media. Therefore, text analysis is deemed to feature more business-based potential than structured data mining. Generally, tax analysis, also called text mining, is a process to extract useful information and knowledge from unstructured text. Text mining is an inter-disciplinary problem, involving information retrieval, machine learning, statistics, computing linguistics, and data mining in particular. Most text mining systems are based on text expressions and natural language processing (NLP), with more focus on the latter.

Document introduction and query processing are the foundation for developing vector space model, Boolean Retrieval Model, and probability retrieval model, which constitute the foundation of search engines. Since the early 1990s, search engines have evolved into a mature business system, which generally consist of rapidly distributed crawling, effectively inverted index, webpage sequencing based on inlink, and search log analysis.

NLP can enable computers to analyze, interpret, and even generate text. Some common NLP methods are: lexical acquisition, word sense disambiguation, part[1]of-speech tagging, and probabilistic context free grammar. Some NLP-based technologies have been applied to text mining, including information extraction, topic models, text summarization, classification, clustering, question answering, and opinion mining. Information mining shall automatically extract specific structured information from texts. Named entity recognition (NER) technology, as a subtask of information extraction, aims to recognize atomic entities in texts subordinate to scheduled categories (e.g. figures, places, and organizations), which have been successfully applied to the development of new analysis and medical applications recently. The topic models are built according to the opinion that “documents are constituted by topics and topics are the probability distribution of vocabulary.” Topic models are models generated by documents, stipulating the probability program to generate documents.

Presently, various probabilistic topic models have been used to analyze document contents and lexical meanings. Text summarization is to generate a reduced summary or extract from a single or several input text files. Text summarization may be classified into concrete summarization and abstract summarization. Concrete summarization selects important sentences and paragraphs from source documents and concentrates them into shorter forms. Abstract summarization may interpret the source texts and, according to linguistic methods, use a few words and phrases to represent the source texts.

Text classification is to recognize probabilistic topic of documents by putting documents in scheduled topics. Text classification based on the new graph representation and graph mining has recently attracted considerable interest. Text clustering is used to group similar documents with scheduled topics, which is different from text classification that gathers documents together. In text clustering, documents may appear in multiple subtopics. Generally, some clustering algorithms in data mining can be utilized to compute the similarities of documents. However, it is also shown that the structural relationship information may be exploited to improve the clustering performance in Wikipedia. The question answering system is designed to search for the optimal answer to a given question. It involves different technologies of question analysis, source retrieval, answer extraction, and answering demonstration. The question answering system may be applied in many fields, including education, website, healthcare, and national defense. Opinion mining, similar to sentiment analysis, refers to the computing technologies for identifying and extracting subjective information from news assessment, comment, and other user-generated contents. It provides opportunities for users to understand the opinions of the public and customers on social events, political movements, business strategies, marketing activities, and product preference.

6.2.3 Web Data Analysis

Over the past decade, we have witnessed the explosive growth of Internet information. Web analysis has emerged as an active research field. Web analysis aims to automatically retrieve, extract, and evaluate information from Web documents and services so as to discover useful knowledge. Web analysis is related to several research fields, including database, information retrieval, NLP, and text mining. According to the different parts of the Web to be mined, we classify Web analysis into three related fields: Web content mining, Web structure mining, and Web usage mining.

Web content mining is the process to discover useful knowledge in Web pages, which generally involve several types of data, such as text, image, audio, video, code, metadata, and hyperlink.

The research on image, audio, and video mining has recently been called multimedia analysis, which will be discussed in Sect. 6.2.4. Since most Web content data is unstructured text data, the research on Web data analysis mainly centers around text and hypertext. Text mining is discussed in Sect. 6.2.2, while Hypertext mining involves mining semi-structured HTML files that contain hyperlinks.

Supervised learning and classification play important roles in hyperlink mining, e.g., email, newsgroup management, and Web catalogue maintenance. Web content mining can be conducted with two methods: the information retrieval method and the database method. Information retrieval mainly assists in or improves information lookup, or filters user information according to deductions or configuration documents. The database method aims to simulate and integrate data in Web, so as to conduct more complex queries than searches based on key words.

Web structure mining involves models for discovering Web link structures. Here, the structure refers to the schematic diagrams linked in a website or among multiple websites. Models are built based on topological structures provided with hyperlinks with or without link description. Such models reveal the similarities and correlations among different websites and are used to classify website pages. Page Rank and CLEVER make full use of the models to look up related website pages. Topic[1]oriented crawler is another successful case by utilizing the models. Topic[1]oriented crawler is targeted at selectively discovering pages related to scheduled topic sets. Top-oriented crawler may analyze crawling boundary to look for links mostly related to crawling and to avoid the involvement of irrelevant areas, other than collecting and indexing all accessible webpage files, so as to answer all possible Ad-Hoc queries. This way, a great quantity of hardware and network resources may be saved and crawling updating task may be assisted.

Web usage mining aims to mine auxiliary data generated by Web dialogues or behaviors. Web content mining and Web structure mining use the master Web data. Web usage data includes access logs at Web servers, logs at proxy servers, browsers’ history records, user profiles, registration data, user sessions or trades, cache, user queries, bookmark data, mouse click and scroll, and any other kind of data generated through interaction with the Web. As Web services and the Web2.0 are becoming mature and popular, Web usage data will have increasingly high variety. Web usage mining plays key roles in personalized space, e-commerce, network privacy/security, and other emerging fields. For example, collaborative recommender systems can personalize e-commerce by utilizing the different preferences of users.

6.2.4 Multimedia Data Analysis

Multimedia data (mainly including images, audios, and videos) have been growing at an amazing speed. Multimedia content sharing is to extract related knowledge and understand semantemes contained in multimedia data. Because multimedia data is heterogeneous and most of such data contains richer information than simple structured data and text data, extracting information is confronted with the huge challenge of the semantic differences of multimedia data. Research on multimedia analysis covers many disciplines. Some recent research priorities include multimedia summarization, multimedia annotation, multimedia index and retrieval, multimedia suggestion, and multimedia event detection, etc.

Audio summarization can be accomplished by simply extracting the prominent words or phrases from metadata or synthesizing a new representation. Video summarization is to interpret the most important or representative video content sequence, and it can be static or dynamic. Static video summarization methods utilize a key frame sequence or context-sensitive key frames to represent a video. Such methods are very simple and have been applied to many business appli[1]cations (e.g., Yahoo!, Alta Visa, and Google), but the playback performance is poor. Dynamic summarization methods use a series of video clips to represent a video, configure low-level video functions, and take other smooth measures to make the final summarization look more natural. In, the authors proposed a topic-oriented multimedia summarization system (TOMS) that can automatically summarize the important information in a video belonging to a certain topic area, based on a given set of extracted features from the video.

Multimedia annotation inserts labels to describe contents of images and videos in both syntax and semantic levels. With the assistance of such labels, the manage[1]ment, summarization, and retrieval of multimedia data can be easily implemented. Since manual annotation is both time and labor intensive, multimedia automatic annotation without any human interventions becomes highly appealing. The main challenge for multimedia automatic annotation is semantic difference, i.e. the difference between low-level features and annotations. Although much progress has been made, the performance of the existing automatic annotation methods still needs to be improved. Currently, many efforts are being made to synchronously explore both manual and automatic multimedia annotation.

Multimedia index and retrieval involve describing, storing, and organizing multimedia information and assisting users to conveniently and quickly look up multimedia resources. Generally, multimedia index and retrieval include five procedures: structural analysis, feature extraction, data mining, classification and annotation, query and retrieval. Structural analysis aims to segment a video into several semantic structural elements, including lens boundary detection, key frame extraction, and scene segmentation, etc. According to the result of structural analysis, the second procedure is feature extraction, which mainly includes further mining the features of necessary key frames, objects, texts, and movements, which are the foundation of video index and retrieval. Data mining, classification, and annotation are generated to utilize the extracted features to find the modes of video contents and put videos into scheduled categories so as to generate video indexes. Upon receiving a query, the system will use a similarity measurement method to look up a candidate video. The retrieval result optimizes the related feedback.

Multimedia recommendation aims to recommend specific multimedia contents according to users’ preferences. It is proven to be an effective approach to provide quality personalized services. Most existing recommendation systems can be classified into content-based systems and collaborative-filtering-based systems. The content-based methods identify users or general features in which the users are interested, and recommend users for other contents with similar features. These methods purely rely on content similarity measurement but most of them are limited by content analysis and excessive specifications. The collaborative-filtering-based methods identify groups with similar interests and recommend contents for group members according to their behaviors. Presently, a mixed method is introduced, which integrates advantages of the aforementioned two types of methods to improve the recommendation quality.

The U.S. NIST initiated the TREC Video Retrieval Evaluation detecting the occurrence of an event in video-clips based on Event Kit, which contains some text description related to concepts and video examples. The research on video event detection is still in its infancy. The existing research on event detection mainly focuses on sports or news events, running or abnormal events in monitoring videos, and other similar events with repetitive patterns. In the author proposed a new algorithm on special multimedia event detection using a few positive training examples.

6.2.5 Network Data Analysis

Network analysis evolved from the initial quantitative analysis and sociological network analysis into the emerging online social network analysis in the beginning of twenty-first century. Many prevailing online social networking services include Twitter, Facebook, and LinkedIn, etc. have been increasingly popular over the years. Such online social networking services generally include massive linked data and content data. The linked data is mainly in the form of graphic structures, describing the communications between two entities. The content data contains text, image, and other network multimedia data. The rich contents of such networks bring about both unprecedented challenges and opportunities to data analysis. In accordance with the data-centered perspective, the existing research on social networking service contexts can be classified into two categories: link-based structural analysis and content-based analysis.

The research on link-based structural analysis has always been committed on link prediction, community discovery, social network evolution, and social influence analysis, etc. SNS may be visualized as graphs, in which every vertex corresponds to a user and edges correspond to the correlations among users. Since SNS are dynamic networks, new vertexes and edges are continually added to the graphs. Link prediction is to predict the possibility of future connection between two vertexes. Many technologies can be used for link prediction, e.g., feature-based classification, probabilistic methods, and Linear Algebra. Feature-based classification is to select a group of features for a vertex and utilize the existing link information to generate binary classifiers to predict the future link. Probabilistic methods aim to build models for connection probabilities among vertexes in SNS. Linear Algebra computes the similarity between two vertexes according to the singular similar matrix. A community is represented by a sub-graphic matrix, in which edges connecting vertexes in the sub-graph feature high density, while the edges between two sub-graphs feature much lower density.

Many methods against community detection have been proposed and studied, most of which are topology-based target functions relying on the concept of capturing community structure. Du et al. utilized the property of overlapping communities in real life to propose a more effective large-scale SNS community detection method. The research on SNS aims to look for a law and deduction model to interpret network evolution. Some empirical studies found that proximity bias, geographical limitations, and other factors play important roles in SNS evolution, and some generation methods are proposed to assist network and system design.

Social influence refers to the case when individuals change their behavior under the influence of others. The strength of social influence depends on the relation among individuals, network distances, time effect, and characteristics of networks and individuals, etc. Marketing, advertisement, recommendation, and other applications can benefit from social influence by qualitatively and quantitatively measuring the influence of individuals on others. Generally, if the proliferation of contents between SNS are considered, the performance of link-based structural analysis may be further improved.

Benefited by the revolutionary progress of Web2.0, the use of generated contents is explosively growing in SNS. SNS is used to generated contents by various technology, including blogs, micro blogs, opinion mining, photos, video sharing, social bookmarking, social network sites, social news, and Wiki. Content-based analysis in SNS is also known as social media analysis. Social media include text, multimedia, positioning, and comments. Nearly all research topics related to structural analysis, text analysis, and multimedia analysis may be interpreted as social media analysis, but social media analysis is confronted with unprecedented challenges. First, massive and continually growing social media data should be automatically analyzed within a reasonable time. Second, social media data contains much noise, e.g., blogosphere contains a large number of spam blogs, and so does trivial Tweets in Twitter. Third, SNS are dynamic networks, which are frequently and quickly changed and updated.

Since social media is close to SNS, social media analysis is inevitably influenced by SNS analysis. SNS analysis refers to the text analysis of SNS context and characteristics of social and network structures, as well as multimedia analysis. The existing research on social media analysis is still in its infancy. The applications of SNS text analysis include transfer learning in keyword search, classification, clustering, and heterogeneous networks. Keyword search tries to synchronously use contents and link behaviors for search. The motivation for such applications is that text files containing similar keywords are generally connected to each other. During classification, assuming all nodes of the SNS are provided with labels, the nodes added with labels are classified. During clustering, researchers aim to determine node sets with similar contents and accordingly group them. Considering that SNS contains massive information of different interlinked objects, e.g., articles, labels, images, and videos, transfer learning in heterogeneous networks aims to transfer knowledge information among different links.

Multimedia datasets in SNS is organized in a structured form, which brings rich information, e.g., semantic ontology, social interaction, community media, geographical maps, and multimedia opinions. Structural multimedia analysis in SNS is also called multimedia information networks. The link structure of multimedia information networks is mainly a logic structure, which are of vital importance to the multimedia in multimedia networks. The logic connection structures in multi[1]media information networks can be classified into four types: semantic ontology, community media, individual photo albums, and geographical positions.

6.2.6 Mobile Traffic Analysis

With the rapid growth of mobile computing, mobile terminals and applications in the world are growing rapidly. By April 2013, Android Apps has provided more than 650,000 applications, covering nearly all categories. By the end of 2012, the monthly mobile data flow has reached 885 PB [51]. The massive data and abundant applications exploit a broad research field for mobile analysis but also bring about a few challenges. As a whole, mobile data has unique characteristics, e.g., mobile sensing, moving flexibility, noise, and a large amount of redundancy. Recently, new research on mobile analysis has been started in different fields. Because of the far immaturity of the research on mobile analysis, we will only introduce some recent and representative analysis applications in this section.

With the growth of numbers of mobile users and the improved performance, mobile phones are now useful for building and maintaining communities, such as communities based on geographical locations and communities based on different cultures and interests, e.g., the latest Wechat. Traditional network communities or SNS communities are in short of online interaction among members, and the communities are active only when members are sitting before computers. On the contrary, mobile phones can support rich interaction anytime and anywhere. Wechat supports not only one-to-one communications, but also many-to-many communication. Mobile communities are defined as that a group of individuals with the same hobbies (i.e., health, safety, and entertainment, etc.) gather together on networks, meet to make a common goal, decide measures through consultation to achieve the goal, and start to implement their plan. In the authors proposed a qualitative model of a mobile community. It is now widely believed that mobile community applications will greatly promote the development of the mobile industry.

RFID labels are used to identify, locate, track, and supervise physical objects in a cost-effective manner. RFID is widely applied to inventory management and logistics. However, RFID brings about many challenges to data analysis: (a) RFID data is very noisy and redundant; (b) RFID data is instant and streaming data with a huge volume and limited processing time. We can track objects and monitor system status by deducing some original events through mining the semantics of RFID data, including location, cluster, and time, etc. In addition, we may design the application logic as complex events and then detect such complex events, so as to realize more advanced business applications. In the authors discussed a shoplifting case as an advanced complex event.

Recently, the progress in wireless sensor, mobile communication technology, and stream processing enable people to build a body area network to have real[1]time monitoring of people’s health. Generally, medical data from different sensors has different characteristics, e.g., heterogeneous attribute sets, different time and space relations, and different physiological relations, etc. In addition, such datasets involve privacy and safety protection. In Garg and others introduced a multi[1]modal transport analysis mechanism of raw data for real-time monitoring of health. Under the circumstance that only highly comprehensive characteristics related to health are available, Park et al. in examined approaches to better utilize such comprehensive information to strength data at all levels. Comprehensive statistics of some partitions is used to recognize clustering and input a characteristic value with a more comprehensive degree. The input characteristics will be further used to predict modeling so as to improve performance.

Researchers from Gjovik University College in Norway and Derawi Biometrics united to develop an application for smart phones, which analyzes paces when people walk and uses the paces for unlocking the safety system. In the meanwhile, Robert Delano and Brian Parise from Georgia Institute of Technology developed an application called iTrem, which monitors human bodies’ trembling with a built-in seismograph in a mobile phone, so as to cope with Parkinson and other nervous system diseases. Many other mobile device applications aim to acquire information through mobile devices, no matter how useful such information is for future data analysis.

6.3 Key Applications

6.3.1 Application of Big Data in Enterprises

At present, big data mainly comes from and used in enterprises, while BI and OLAP can be regarded as the predecessors of big data application. The application of big data in enterprises can enhance their production efficiency and competitiveness in many aspects. In particular, on marketing, with correlation analysis of big data, enterprises can more accurately predict the behavior of consumers and mine new business modes. On sales planning, after comparison of massive data, enterprises can optimize their commodity prices. On operation, enterprises can improve their operation efficiency and operation satisfaction, optimize the input of labor force, accurately forecast personnel allocation requirements, avoid excess production capacity, and reduce labor cost. On supply chain, using big data, enterprises may conduct inventory optimization, logistic optimization, and supplier coordination, etc., to mitigate the gap between supply and demand, control budgets, and improve services.

In finance, the application of big data in enterprises has been rapidly developed. For example, China Merchants Bank (CMB) utilizes data analysis to recognize that such activities as “Multi-times score accumulation” and “score exchange in shops,” are effective for attracting quality customers. By building a customer loss early warning model, the bank can sell high-yield financial products to the top 20 % customers in loss ratio so as to retain them. As a result, the loss ratios of customers with Gold Cards and Sunflower Cards have been reduced by 15 % and 7 %, respectively. By analyzing customers’ transaction records, potential small and micro corporate customers can be effectively identified. By utilizing remote banking and the cloud referral platform to implement cross-selling, considerable performance gains were achieved.

Obviously, the most classic application is in e-commerce. Tens of thousands of transactions are conducted in Taobao and the corresponding transaction time, commodity prices, and purchase quantities are recorded every day. More important, such information matches age, gender, address, and even hobbies and interests of buyers and sellers. Data Cube of Taobao is a big data application on the Taobao plat[1]form, through which, merchants can be ware of the macroscopic industrial status of the Taobao platform, market conditions of their brands, and consumers’ behaviors, etc., and accordingly make production and inventory decisions. Meanwhile, more consumers can purchase their favorite commodities with more preferable prices.

The credit loan of Alibaba automatically analyzes and judges if to lend loans to enterprises through the acquired enterprise transaction data by virtue of big data technologies, while manual intervention does not occur in the entire process. It is disclosed that, so far, Alibaba has lent more than RMB 30 billion Yuan, with the rate of bad loans of only about 0.3 %, which greatly lower than those of other commercial banks.

6.3.2 Application of IoT Based Big Data

Internet of Things is not only an important source of big data, but also the main market of application of big data. In Internet of Things, every object in the real world may be both the producer and consumer of data and, because of the high variety of objects, the applications of Internet of Things also evolve endlessly.

Logistic enterprises may have profoundly experienced with the application of big data of Internet of Things. Trucks of UPS are installed with sensors, wireless adapters, and GPS, so the Headquarter can track truck positions and prevent engine failures. Meanwhile, this equipment also help UPS supervise and manage its employees, and optimize delivery routes. The optimal delivery routes specified by UPS for trucks are derived from their past driving experience. In 2011, UPS drivers have driven for nearly 48.28 million km less.

Smart city is a hot research area based on the application of Internet of Things data. The U.S. Miami-Dade County is a sample of smart city. The smart city project cooperation between Miami-Dade County in Florida and IBM closely connects 35 types of key county government departments and Miami City, and helps government leaders obtain better information support in decision making for managing water resources, reducing traffic jam, and improving public safety. IBM provides Dade County with smart instrument panel application by virtue of the in-depth analysis under cloud computing, so as to help the departments of county government with coordination-based and visualized management. The application of smart city brings about benefits in many aspects for Dade County. For example, Department of Park Management of Dade County saved one million USD in water bills due to timely identifying and fixing water pipes that were running and leaking this year.

6.3.3 Application of Online Social Network-Oriented Big Data

Online SNS is a social structure constituted by social individuals and connections among individuals based on an information network. Big data of online SNS mainly comes from instant messages, online social, micro blog, and shared space, etc. Since the big data of online SNS represents various user activities, the analysis of such data receives more attention. The analysis of big data of online SNS uses computational analytical method provided for understanding relations in the human society by virtue of theories and methods, which involves mathematics, informatics, sociology, and management science, etc., from three dimensions including network structure, group interaction, and information spreading. The application of big data of online SNS includes network public opinion analysis, network intelligence collection and analysis, socialized marketing, government decision-making support, and online education, etc. Figure 6.1 illustrates the technical framework of the application of big data of online SNS. Classic applications of big data of online SNS are introduced in the following, which mainly mine and analyze content information and structural information to acquire values.

• Content-Based Applications: Language and text are two most important forms of representation in SNS. Through the analysis of language and text, user preferences, emotions, interests, and demands, etc. may be revealed.

• Structure-Based Applications: On SNS with users as nodes, social relation, interest, and hobbies, etc. aggregate relations among users into a clustered structure. Such structure with close relations among internal individuals but loose externally relations is also called a community. The community-based analysis is of vital importance to improve information propagation and for the research on interpersonal relation analysis.

The U.S. Santa Cruz Police Department experimented by applying data to conducting predictive analysis. By analyzing SNS, the police department can discover crime trends and crime modes, and even predict the crime rates in major regions.

In April 2013, Wolfram Alpha, a computing and search engine of the U.S., studied the law of social behaviors of users by analyzing social data of more than one million American users of Facebook. According to the analysis, it was found that most users of Facebook fall in love in their early 20s, get engaged when they are about 27 years old, get married when they are about 30 years old, and have slow changes in their marriage relationship between 30 and 60 years old. Such research results are highly consistent with the demographic census data of the U.S.

Global Pulse conducted a research that revealed some laws in social and economic activities using SNS data. This project utilized publicly available Twitter messages in English, Japanese, and Indonesian from July 2010 to October 2011, to analyze topics related to food, fuel, housing, and loan. The goal is to better understand public behavior and concerns. This project analyzed SNS big data from several aspects: predicting the occurrence of abnormal events by detecting the sharp growth or drop of the amount of topics; observing the weekly and monthly trends of dialogs on Twitter; developing models for the variation in the level of attention on specific topics over time; understanding the transformation trends of user behavior or interest by comparing ratios of different sub-topics; and predicting trends with external indicators involved in Twitter dialogues. As a classic example, the project discovered that the rice price follows the change of food price inflation from the official statistics of Indonesia, by analyzing topics related to rice price on Twitter (Fig. 6.2).

Generally speaking, the application of big data of online SNS may help to better understand people’s behavior and master the laws of social and economic activities from the following three aspects:

• Early Warning: to rapidly cope with crisis if any by detecting abnormalities in the usage of electronic devices and services.

• Real-Time Monitoring: to provide accurate information for the formulation of polices and plans by monitoring the current behavior, emotion, and preference of users.

• Real-Time Feedback: acquire groups’ feedbacks against some social activities based on real-time monitoring.

The application of big data of online SNS involves three core technical problems:

• Data Model: Most traditional SNS data models are based on the static mode and specific analytical algorithms, and are not amenable for effective computation with data in the PB and higher scales. On the other hand, SNS analysis usually implements multi-dimensional complex relevant analysis on dynamic data. New theories and models need to be investigated to bridge this gap.

• Data Storage and Management: The existing Internet based storage management methods mainly support big data storage and rapid query. However, the existing approach does not effectively support the analytical computation of big data of online SNS, featuring high correlation, dynamic variability, and multi[1]dimensional evolution, etc. Therefore, new storage and management methods need to be developed.

• Data Analysis: The existing analytical methods on big data of SNS are mainly based on single-dimensional attribute, with insufficient accuracy. On the other hand, SNS analysis, such as topic evolution, group interaction, and public emotion drifting, etc., usually incorporates complex correlation analysis from the perspective of structure, group, and information. There is a need for the basic theory and methods to support complex correlation, multi-dimensional, large[1]scale, dynamic data.

6.3.4 Applications of Healthcare and Medical Big Data

Medical data is continuously and rapidly growing containing abundant and various information values. Big data has unlimited potential for effectively storing, process[1]ing, querying, and analyzing medical data. The application of medical big data will profoundly influence the human health.

For example, Aetna Life Insurance Company selected 102 patients from a pool of a 1,000 patients to complete an experiment in order to help predict the recovery of patients with metabolic syndrome. In an independent experiment, it scanned 600,000 laboratory test results and 180,000 claims through a series of detection test results of metabolic syndrome of patients in three consecutive years. In addition, it summarized the final result into an extreme personalized treatment plan to assess the dangerous factors and main treatment plans of patients. This way, doctors may reduce morbidity by 50 % in the next 10 years by prescribing statins and helping patients to lose weight by five pounds, or suggesting patients to reduce the total triglyceride in their bodies if the sugar content in their bodies is over 20 %.

The Mount Sinai Medical Center in the U.S. utilizes technologies of Ayasdi, a big data company, to analyze all genetic sequences of Escherichia Coli, including over one million DNA variants, to know why bacterial strains resist antibiotics. Ayasdi’s technology uses Topological data analysis, a brand-new mathematic research method, to understand data characteristics. HealthVault of Microsoft is an excellent application of medical big data launched in 2007. The goal is to manage individual health information in individual and family medical equipment. Presently, health information can be entered and uploaded with mobile smart devices and imported into individual medical records by a third-party agency. In addition, it can be integrated with a third-party application with the software development kit (SDK) and open interface.

6.3.5 Collective Intelligence

With the rapid development of wireless communication and sensor technologies, mobile phones and tablet computers have integrated more and more sensors, with increasingly stronger computing and sensing capacities. As a result, crowd sensing is coming to the center stage of mobile computing. In crowd sensing, a large number of general users utilize mobile devices as basic sensing units to conduct coordination with mobile networks for distribution of sensed tasks and collection and utilization of sensed data. The goal is to complete large-scale and complex social sensing tasks. In crowd sensing, participants who complete complex sensing tasks do not need to have professional skills. Crowd sensing modes represented by Crowdsourcing has been successfully applied to geotagged photograph, positioning and navigation, urban road traffic sensing, market forecast, opinion mining, and other labor-intensive applications.

Crowdsourcing, a new approach for problem solving, takes a large number of general users as the foundation and distributes tasks in a free and voluntary way. Crowdsourcing can be useful for labor-intensive applications, such as pic[1]ture marking, language translation, and speech recognition. The main idea of Crowdsourcing is to distribute tasks to general users and to complete tasks that users could not individually complete or do not anticipate to complete. With no need for intentionally deploying sensing modules and employing professionals, Crowdsourcing can broaden the sensing scope of a sensing system to reach the city scale and even larger scales.

As a matter of fact, Crowdsourcing has been applied by many companies before the emergence of big data. For example, P & G, BMW, and Audi improve their R & D and design capacities by virtue of Crowdsourcing. In the big data era, Spatial Crowdsourcing becomes a hot topic. The operation framework of Spatial Crowdsourcing is shown as follows. A user may request the service and resources related to a specified location. Then the mobile users who are willing to participate in the task will move to the specified location to acquire related data (such as video, audio, or pictures). Finally, the acquired data will be send to the service requester. With the rapid growth of usage of mobile devices and the increasingly complex functions provided by mobile devices, it can be forecasted that Spatial Crowdsourcing will be more prevailing than traditional Crowdsourcing, e.g., Amazon Turk and Crowdflower.

6.3.6 Smart Grid

Smart Grid is the next generation power grid constituted by traditional energy networks integrated with computation, communications and control for optimized generation, supply, and consumption of electric energy. Smart Grid related big data are generated from various sources, such as (a) power utilization habits of users, (b) phasor measurement data, which are measured by phasor measurement unit (PMU) deployed national-wide, (c) energy consumption data measured by the smart meters in the Advanced Metering Infrastructure (AMI), (d) energy market pricing and bidding data, (e) management, control and maintenance data for the devices and equipment in the power generation, transmission and distribution networks (such as Circuit Breaker Monitors and transformers). Smart Grid brings about the following challenges on exploiting big data.

• Grid Planning: By analyzing data in Smart Grid, the regions can be identified that have excessive high electrical load or power outage frequencies. Even the transmission lines with high failure possibility can be predicted. Such analytical results may contribute to grid upgrading, transformation, and maintenance, etc. For example, researchers from University of California, Los Angeles designed an “electric map” according to the big data theory and made a California map by integrating census information and real-time power utilization information provided by electric power companies. The map takes a block as a unit to demonstrate the power consumption of every block at the moment. It can even compare the power consumption of the block with the average income per capita and building types, so as to obtain more accurate power usage habits of all kinds of groups in the community. This map provides effective and visual load forecast for city and power grid planning. Preferential transformation on power grid facilities in blocks with high power outage frequencies and serious overloads may be conducted, as displayed in the map.

• Interaction Between Power Generation and Power Consumption: An ideal power grid shall balance power generation and power consumption. However, the traditional power grid is constructed based on one-directional approach of transmission-transformation-distribution-consumption, which could not adjust the generation capacity according to the demand of power consumption, thus leading to electric energy redundancy and waste. To this end, smart electric meters are developed to enable the interaction between power consumption and power generation, and to improve power supply efficiency. TXU Energy has widely deployed smart electric meters with a big success. Power supply companies can read power utilization data every other 15 min other than every month in the past. Therefore, labor cost for meter reading is saved and, because power utilization data (a source of big data) are frequently and rapidly acquired and analyzed, power supply companies can adjust the electricity price according to peak and low periods of power consumption. TXU Energy utilized such price lever to stabilize the peak and low fluctuations of power consumption. As a matter of fact, the application of big data in the smart grid can help the realization of time-sharing dynamic pricing, which is a win-win situation for both energy suppliers and users.

• Access of Intermittent Renewable Energy: At present, many new energy resources, such as wind energy and solar energy, are also accessed to power grids. However, since the power generation capacities of such new energy resources are closely related to climate conditions that feature randomness and intermittency, it is challenging to access them to power grids. If the big data of power grids is effectively analyzed, such intermittent renewable new energy sources can be effectively managed: the electricity generated by the new energy resources can be allocated to regions with electricity shortage. Such energy resources can complement the traditional hydropower and thermal power generations.

Big Data Related Technologies, Challenges and Future Prospects

Big Data Analysis

Abstract In this chapter, we introduce the methods, architectures and tools for big data analysis. The analysis of big data mainly involves analytical methods for traditional data and big data, analytical architecture for big data, and software used for mining and analysis of big data. Data analysis is the final and the most important phase in the value chain of big data, with the purpose of extracting useful values, providing suggestions or decisions. Different levels of potential values can be generated through the analysis of datasets in different fields.

5.1 Traditional Data Analysis

Traditional data analysis means to use proper statistical methods to analyze massive first-hand data and second-hand data, to concentrate, extract, and refine useful data hidden in a batch of chaotic data, and to identify the inherent law of the subject matter, so as to develop functions of data to the greatest extent and maximize the value of data. Data analysis plays a huge guidance role in making development plans for a country, as well as understanding customer demands and predicting market trend by enterprises.

Big data analysis can be deemed as the analysis of a special kind of data. Therefore, many traditional data analysis methods may still be utilized for big data analysis. Several representative traditional data analysis methods are examined in the following, many of which are from statistics and computer science.

• Cluster Analysis: cluster analysis is a statistical method for grouping objects, and specifically, classifying objects according to some features. Cluster analysis is used to differentiate objects with certain features and divide them into some categories (clusters) according to these features,such that objects in the category will have high homogeneity different categories will have high heterogeneity. Cluster analysis is an unsupervised study method without the use of training data.

• Factor Analysis: factor analysis is basically targeted at describing the relation among many indicators or elements with only a few factors, i.e., grouping several closely related variables and then every group of variables becomes a factor (called a factor because it is unobservable, i.e., not a specific variable), and the few factors are then used to reveal the most valuable information of the original data.

• Correlation Analysis: correlation analysis is an analytical method for deter[1]mining the law of correlations among observed phenomena and accordingly conducting forecast and control. There are a plentiful of quantitative relations among observed phenomena such as correlation, correlative dependence, and mutual restriction. Such relations may be classified into two types: (a) function, reflecting the strict dependence relationship among phenomena, which is also called a definitive dependence relationship, among which, every numerical value of a variable corresponds to one or several determined values; (b) correlation, under which some undetermined and inexact dependence relations exist, and a numerical value of a variable may correspond to several numerical values of the other variable, and such numerical values present regular fluctuation surrounding their mean values. A classic example is that customers of many supermarkets purchase beers while they are buying diapers.

• Regression Analysis: regression analysis is a mathematical tool for revealing correlations between one variable and several other variables. Based on a group of experiments or observed data, regression analysis identifies dependence relationships among variables hidden by randomness. Regression analysis may change complex and undetermined correlations among variables into simple and regular correlations.

• A/B Testing: also called bucket testing. It is a technology for determining plans to improve target variables by comparing the tested group. Big data will require a large number of tests to be executed and analyzed, to ensure sufficient scale of the groups for detecting the significant differences between the control group and the treatment group.

• Statistical Analysis: Statistical analysis is based on the statistical theory, a branch of applied mathematics. In statistical theory, randomness and uncertainty are modeled with Probability Theory. Statistical analysis can provide description and inference for large-scale datasets. Descriptive statistical analysis can summarize and describe datasets and inferential statistical analysis draws conclusions from data subject to random variations. Analytical technologies based on complex multi-variate statistical analysis include regression analysis, factor analysis, clustering, and recognition analysis, etc. Statistical analysis is widely applied in the economic and medical care fields [1].

• Data Mining: Data mining is a process for extracting hidden, unknown, but potentially useful information and knowledge from massive, incomplete, noisy, fuzzy, and random data. There are also some terms similar to data mining, e.g., discovering knowledge from databases, data analysis, data fusion, and decision support.

Data mining is mainly used to complete the following six different tasks, with corresponding analytical methods: Classification, Estimation, Prediction, Affinity grouping or association rules, Clustering, and Description and Visu[1]alization. Original data is deemed as the source to form knowledge and data mining is a process of discovering knowledge from the original data. Original data may be structured data, e.g., data in relational databases, or semi-structured data, e.g., text, graphical, and image data, or even heterogeneous data distributed in the network. Methods to discover knowledge may be mathematical or non[1]mathematical, and deductive or inductive. Discovered knowledge may be used for information management, query optimization, decision support, and process control, as well as data maintenance.

Mining methods are generally divided into machine learning methods, neural network methods, and database methods. Machine learning may be next divided into inductive learning, example-based learning, and genetic algorithms, etc. Neural network methods may be divided into feedforward neural networks and self-organizing neural networks, etc. Database methods mainly include multi[1]dimensional data analysis or OLAP (On-Line Analytical Processing), as well as attribute-oriented inductive method.

Various data mining algorithms have been developed, including artificial intelligence, machine learning, mode identification, statistics and database com[1]munity, etc. In 2006, The IEEE International Conference on Data Mining Series (ICDM) identified ten most influential data mining algorithms through a strict selection procedure [2], including C4.5, k-means, SVM, Apriori, EM, Naive Bayes, and Cart, etc. These ten algorithms cover classification, clustering, regression, statistical learning, association analysis, and linking mining, all of which are the most important problems in data mining research. In addition, other advanced algorithms such as neural networks and genetic algorithms can also be applied to data mining in different applications. Some prominent applications are gaming, business, science, engineering, and supervision, etc.

5.2 Big Data Analytic Methods

In the dawn of the big data era, people are concerned with how to rapidly extract key information from massive data so as to bring values for enterprises and individuals. At present, the main processing methods of big data are shown as follows.

• Bloom Filter: Bloom Filter is actually a bit array and a series of Hash functions. The principle of Bloom Filter is to store Hash values of data other than data itself by utilizing a bit array, which is in essence a bitmap index that uses Hash functions to conduct lossy compression storage of data. It has such advantages as high space efficiency and high query speed, but also with some disadvantages like having a certain misrecognition rate and deletion difficulty. Bloom Filter applies to big data applications that allow a certain misrecognition rate.

• Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values or index values. Hashing has such advantages as rapid reading, writing, and high query speed, but a sound Hash function is hard to be found.

• Index: index is always an effective method to reduce the expense of disc reading and writing, and improve insertion, deletion, modification, and query speeds in both traditional relational databases that manage structured data, and technologies that manage semi-structured and unstructured data. However, index has a disadvantage that it has the additional cost for storing index files and the index files should be maintained dynamically according to data updates.

• Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word frequency statistics. The main idea of Triel is to utilize common prefixes of character strings to reduce comparison on character strings to the greatest extent, so as to improve query efficiency.

• Parallel Computing: compared to traditional serial computing, parallel computing refers to utilizing several computing resources to complete a computation task. Its basic idea is to decompose a problem and assign them to several independent processes to be independently completed, so as to achieve co[1]processing. Presently, some classic parallel computing models include MPI (Message Passing Interface), MapReduce, and Dryad. A qualitative comparison of the three models is presented in Table 5.1.

Although the parallel computing systems or tools, such as MapReduce or Dryad, are useful for big data analysis, they are low levels tools that have a steep learning curve. Therefore, some high-level parallel programming tools or languages are being developed based on these systems. Such high-level languages include Sawzall, Pig, and Hive used for MapReduce, and Scope and DryadLINQ used for Dryad.

5.3 Architecture for Big Data Analysis

Due to the wide range of sources and variety, different structures, and the broad application fields of big data, different analytical architectures shall be considered for big data with different application requirements.

5.3.1 Real-Time vs. Offline Analysis

Big data analysis can be classified into real-time analysis and off-line analysis according to the real-time requirement. Real-time analysis is mainly used in E[1]commerce and finance. Since data constantly changes, rapid data analysis is needed and analytical results shall be returned with a very short delay. The main existing architectures of real-time analysis include (a) parallel processing clusters using traditional relational databases, and (b) memory-based computing platforms. For example, Greenplum from EMC and HANA from SAP are all real-time analysis architectures.

Offline analysis is usually used for applications without high requirements on response time, e.g., machine learning, statistical analysis, and recommendation algorithms. Offline analysis generally conducts analysis by importing big data of logs into a special platform through data acquisition tools. Under the big data setting, many Internet enterprises utilize the offline analysis architecture based on Hadoop in order to reduce the cost of data format conversion and improve the efficiency of data acquisition. Examples include Facebook’s open source tool Scribe, LinkedIn’s open source tool Kafka, Taobao’s open source tool Timetunnel, and Chukwa of Hadoop, etc. These tools can meet the demands of data acquisition and transmission with hundreds of MB per second.

5.3.2 Analysis at Different Levels

Big data analysis can also be classified into memory level analysis, Business Intelligence (BI) level analysis, and massive level analysis, which are examined in the following.

• Memory-Level: Memory-level analysis is for the case when the total data volume is within the maximum level of the memory of a clusters. The memory of current server cluster surpasses hundreds of GB while even the TB level is common. Therefore, an internal database technology may be used and hot data shall reside in the memory so as to improve the analytical efficiency. Memory-level analysis is extremely suitable for real-time analysis. MongoDB is a representative memory-level analytical architecture. With the development of SSD (Solid-State Drive), the capacity and performance of memory-level data analysis has been further improved and widely applied.

• BI: BI analysis is for the case when the data scale surpasses the memory level but may be imported into the BI analysis environment. Currently, mainstream BI products are provided with data analysis plans supporting the level over TB.

• Massive: Massive analysis for the case when the data scale has completely surpassed the capacities of BI products and traditional relational databases. At present, most massive analysis utilize HDFS of Hadoop to store data and use MapReduce for data analysis. Most massive analysis belongs to the offline analysis category.

5.3.3 Analysis with Different Complexity

The time and space complexity of data analysis algorithms differ greatly from each other according to different kinds of data and application demands. For example, for applications that are amenable to parallel processing, a distributed algorithm may be designed and a parallel processing model may be used for data analysis.

5.4 Tools for Big Data Mining and Analysis

Many tools for big data mining and analysis are available, including professional and amateur software, expensive commercial software, and free open source software. In this section, we briefly review the top five widely used software, according to a survey of “What Analytics, Data mining, Big Data software you used in the past 12 months for a real project” of 798 professionals made by KDNuggets in 2012.

• R (30.7 %): R, an open source programming language and software environment, is designed for data mining/analysis and visualization. While compute-intensive tasks are executed, code programmed with C, C++, and Fortran may be in under the R environment. In addition, skilled users may directly call R objects in C. R is a realization of the S language. S is an interpreted language developed by AT&T Bell Labs and used for data exploration, statistical analysis, and drawing plots. Initially, S was mainly implemented in S-PLUS, but S-PLUS is a commercial software. Compared to S, R is more popular since it is open source. R ranks top 1 in the KDNuggets 2012 survey. Furthermore, in a survey of “Design languages you have used for data mining/analysis in the past year” in 2012, R was also in the first place, defeating SQL and Java. Due to the popularity of R, database manufacturers such as Teradata and Oracle both released products supporting R.

• Excel (29.8 %): Excel, a core component of Microsoft Office, provides powerful data processing and statistical analysis capability, and aids decision making. When Excel is installed, some advanced plug-ins, such as Analysis ToolPak and Solver Add-in, with powerful functions for data analysis are also integrated but such plug-ins can be used only if users enable them. Excel is also the only commercial software among the top five.

• Rapid-I Rapidminer (26.7 %): Rapidminer is an open source software used for data mining, machine learning, and predictive analysis. In an investigation of KDnuggets in 2011, it was more frequently used than R (ranked Top 1). Data mining and machine learning programs provided by RapidMiner include Extract, Transform and Load (ETL), data pre-processing and visualization, modeling, evaluation, and deployment. The data mining flow is described in XML and displayed through a graphic user interface (GUI). RapidMiner is written in Java. It integrates the learner and evaluation method of Weka, and works with R. Functions of Rapidminer are implemented with connection of processes of operators. The entire flow can be deemed as a production line of a factory, with original data input and model results output. The operators can be regarded as specific functions and feature different input and output characteristics.

• KNIME (21.8 %): KNIME (Konstanz Information Miner) is a user-friendly, intelligent, and open-source-rich data integration, data processing, data analysis, and data mining platform. It allows users to create data flows or data channels in a visualized manner, to selectively run some or all analytical procedures, and provides analytical results, models, and interactive views. KNIME was written in Java and, based on Eclipse, provides more functions as plug-ins. Through plug[1]in files, users can insert processing modules to files, pictures, and time series, and integrate them into various open source projects, e.g., R and Weka. KNIME controls data integration, cleansing, conversion, filtering, statistics, mining, and finally data visualization. The entire development process is conducted under a visualized environment. KNIME is designed as a module-based and expandable framework. There is no dependence between its processing units and data containers, making them adaptive to the distributed environment and independent development. In addition, it is easy to expand KNIME. Developers can effortlessly expand various nodes and views of KNIME.

• Weka/Pentaho (14.8 %): Weka, abbreviated from Waikato Environment for Knowledge Analysis, is a free and open-source machine learning and data mining software written in Java. Weka provides such functions as data processing, feature selection, classification, regression, clustering, association rule, and visualization, etc. Pentaho is one of the most popular open-source commercial intelligent software. It is a BI kit based on the Java platform. It includes a web server platform and several tools to support report, analysis, chart, data integration, and data mining, etc., all aspects of BI. Weka’s data processing algorithms are also integrated in Pentaho and can be directly called.

Monday, January 31, 2022

Big Data Related Technologies, Challenges and Future Prospects

Big Data Related Technologies, Challenges and Future Prospects

Big Data Related Technologies, Challenges and Future Prospects

Labels

INSTRUMENTATION MANUFACTURERS