Back to List

Realtime and Near-Realtime Data Sources and Data Integrity

Matt Pluster Matt Pluster  |  
Dec 17, 2019
 
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.
 
In previous blogs in this series, I dug into advantages, disadvantages, and best practices related to Realtime vs Near-Realtime data processing and availability. In this final installment, I want to offer a few quick notes on data sources and data integrity related to these two solution types.
 

Realtime Dataset Options

Let’s start with a couple quick notes on Realtime datasets. Power BI supports several different Realtime data sources including push datasets, streaming datasets, and what's known as a PubNub streaming dataset.
 

Pushed Dataset

With a pushed dataset, data is basically being pushed into the Power BI service. We can think of this as being very similar to archiving the data in the back-end of Power BI. The Power BI service creates a new database underlying the service to store that data.
 
That also creates the ability to look at a historical version of what's happened over time. Once a report has been created using a push dataset, any visuals from that dataset can be pinned to a dashboard. Those visuals will update in Realtime whenever the data is updated. It’s like a trigger – within the service, the dashboard is triggered to refresh that tile when new data is received.
 

Streaming Dataset

Data is also being pushed into the Power BI service with a streaming dataset, but with one important difference: Power BI is storing the data in a temporary cache, which quickly expires. That temporary cache is used to display visuals, which have a transient sense of history. We may be able to see a little trending information (like a line chart with a time window of an hour), but it expires. We can't look back over longer periods of time.
 
With the streaming dataset, there is no underlying database. You can't build report visuals using the data that flows in from the stream. There are no filtering, custom visuals, or other report functions that we would typically have in Power BI. The only way to visualize data coming from that streaming dataset is to add a tile to your dashboard that uses the streaming dataset as its data source. When that happens, those custom streaming tiles are optimized for quickly displaying that custom Realtime data.
 
The result is that there's very little latency between the time when the data's being pushed into the Power BI service, and when that visual is being updated.
 

PubNub Streaming Dataset

The third scenario is PubNub. With a PubNub streaming dataset, the Power BI web client is using the PubNub software development kit to read an existing PubNub data stream. I won’t go into the details of building a PubNub data stream in this blog, but their datasets are like a traditional streaming dataset. They can only be visualized by adding a tile to the dashboard.
 
Each of these three types of Realtime datasets offer different advantages and update rates that may meet your ultimate Power BI use case.
 

Storage Approaches for High-Volume/High-Velocity

Of course, there are different concerns if you need to store high volume or high velocity data, rather than just have it displayed or temporarily stored in Power BI.
 
Persisting data in an Azure SQL DB instance is a solid and highly scalable option. Microsoft has invested heavily in scalability, and the ability to turn up the performance of a single Azure SQL DB instance is rather amazing. In many cases, that instance will suffice.
 
However, if we've got high velocity and high-volume data, we will want to think about Azure Data Lake storage or Azure Blob storage. I prefer Azure Data Lake, especially the Gen2 capabilities. Azure Data Lake lets us store big data at a very low cost. It also provides a table structure – which is familiar to those used to working in Azure SQL DB or a regular SQL server instance. We have many clients with IoT or IIoT Realtime scenarios who are persisting data in an Azure Data Lake storage environment.
 

Data Integrity with Cloud-Based Data Solutions

As a last comment, one of the questions we’ve received when we talk about data solutions is, “Have you seen any issues with the data and data integrity when it comes to cloud-based solutions?” Users can be concerned with keeping transactions in sync. Locking and blocking issues can happen in a transactional system when, as a record is being updated, we don't have the capability to also load that record into another data storage area. We can seamlessly engineer around potential data integrity issues, but it does require an awareness of the capabilities and limitations of the data technologies being used.
 
One of the other keys to maintaining that integrity is implementing a framework that identifies where a potential data integrity issue has occurred – like where a transaction hasn't loaded or where we've had a failure in processing. At Skyline, we implement different logging and auditing capabilities so we can watch for those issues and potentially quarantine the data if something comes up.
 

Conclusion

While this is the last installment of my blog series on the topic of Realtime vs Near-Realtime data, I’m happy to talk through any considerations regarding any Realtime, Near-Realtime or any other data and analytics questions that your organization may be facing.
 
Good luck with your Realtime and Near-Realtime data processing use cases!
 
Data Analytics

 

Love our Blogs?

Sign up to get notified of new Skyline posts.

 


Related Content


Blog Article
What is Microsoft’s Power Platform and How to Use It: The Guide
Skyline Technologies  |  
Jan 14, 2020
In this guide, Libby Fisette (Director of Skyline Modern Workplace team) and Marcus Radue (Data Analytics Engineer), dig into the functionality of the Microsoft Power Platform and how you can leverage this toolset to solve many business situations. From basics to key questions, you will find...
Blog Article
Mitigating the Risks of Realtime or Near-Realtime Data Processing
Matt PlusterMatt Pluster  |  
Dec 10, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.   In previous blogs in this series, I’ve talked about...
Blog Article
“The Other Realtime”: Low-Latency Data Processing via DirectQuery
Matt PlusterMatt Pluster  |  
Dec 03, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.    So far in this blog series, we have talked...
Blog Article
Realtime vs Near-Realtime Data: Pros and Cons
Matt PlusterMatt Pluster  |  
Nov 26, 2019
In this blog series, Matt Pluster, Director of Skyline’s Data Analytics Consulting Practice, explores data sources and processing options. For a full overview on this topic, check out the Realtime vs Near-Realtime webinar.   In this blog series, we are looking at the matchup of...
Blog Article
Two Megatrends Driving the Coming Data Tsunami (Are You Ready?)
Tim MorrowTim Morrow  |  
Oct 29, 2019
Data is growing and proliferating at an unprecedented rate. According to a 2016 report from IBM, "90% of the data in the world today has been created in the last two years alone" - and that was three years ago. Imagine how much more data we have today. ...