Big Data solutions for SQL Server

Big Data is large amount of the data which is difficult or impossible for traditional relational database. Big data, the term has seen increasing use since the past few years. In this field, we review the various ways that big data is described and how Hadoop which is developed as a technology, is commonly used to process big data. In addition, we introduce Microsoft HDInsight, an implementation of Hadoop available as a Windows Azure service. Then we explore Microsoft PolyBase, an on-premises solution that integrates relational data stored in MICROSOFT SQL SERVER Parallel Data Warehouse (PDW) with non-relational data stored in a Hadoop Distributed File System (HDFS).

Describing Big Data

The point at which data becomes big data is still the subject of much debate among data-management professionals. One approach of describing big data is known as the 3Vs: volume, velocity, and variety. This model, introduced by Gartner analyst Doug Laney in 2001, has been extended with a fourth V, variability. However, disagreement continues, with some people considering the fourth V to be veracity.

Although it seems reasonable to associate volume with big data, how is a large volume different from the very large databases (VLDBs) and extreme workloads that some industries routinely manage? Examples of data sources that fall into this category include airline reservation systems, point of sale terminals, financial trading, and cellular-phone networks. As machine-generated data outpaces human-generated data, the volume of data available for analysis is proliferating rapidly. Many techniques, as well as software and hardware solutions such as PDW, exist to address high volumes of data. Therefore, many people argue that some other characteristic must distinguish big data from other classes of data that are routinely managed.

Some people suggest that this additional characteristic is velocity or the speed at which the data is generated. As an example, consider the data generated by the Large Hadron Collider experiments, which is produced at a rate of 1 gigabyte (GB) per second. This data must be subsequently processed and filtered to provide 30 petabytes (PB) of data to physicists around the world. Most organizations are not generating data at this volume or pace, but data sources such as manufacturing sensors, scientific instruments, and web-application servers are nonetheless generating data so fast that complex event-processing applications are required to handle high-volume and high-speed throughputs. Microsoft StreamInsight is a platform that supports this type of data management and analysis.

Data does not necessarily require volume and velocity to be categorized as big. Instead, a high volume of data with a lot of variety can constitute big data. Variety refers to the different ways that data might be stored: structured, semistructured, or unstructured. On the one hand, data-warehousing techniques exist to integrate structured data (often in relational form) with semistructured data (such as XML documents). On the other hand, unstructured data is more challenging, if not impossible, to analyze by using traditional methods. This type of data includes documents in PDF or Word format, images, and audio or video files, to name a few examples. Not only the unstructured data problematic for analytical solutions, but it is also growing more quickly than file systems on a single server that it can usually accommodate.

Big data as a branch of data management is still difficult to define with precision, given that many competing views exist and that no clear standards or methodologies have been established. Data that looks big to one organization by any of the definitions we’ve described might look small to another organization that has evolved solutions for managing specific types of data. Perhaps the best definition of big data at present is also the most general. For the purpose of this chapter, we take the position that big data describes a class of data that requires a different architectural approach than the currently available relational database systems it can effectively support, such as append-only workloads instead of updates.

3 Likes

Very good friendly introduction about BIG data. Mindmajix must look into offering “affordable value added Ignition training and project engineering support services”. Many SCADA SI’s can’t afford to recruit full time JAVA, Python, Javascript and Database experts. But most of the IT experts also do not know SCADA. You can take the initiative to bridge the gap.

Launch Ignition courses in partnership with instructors like Maximilian Schwarzmüller, Siraj Rval etc on Udemy. In fact i suggest, Mindmajix can talk to Ignition sales, buy bulk licenses and become a VAR. Your company will hit a jackpot :slight_smile:

Hint: If you look into the technical discussions, you will notice most of them are scripting and DB related. Add Python ML, React programming, mobile apps, ERP integration on top of it. That proves the opportunity.

A very huge volume of data on relational db is lying uncared in every SCADA control room. How do you think, BIG data solutions can bring value to these companies?