LUC #21: Understanding Data Streams — The Solution to Handling Continuous Flows of Big Data
Plus, how the TCP handshake works, using Big O effectively in interviews, and the operation behind data processing systems
Welcome back to another edition of Level Up Coding’s newsletter.
In today’s issue:
Demystifying Data Streams
How Does The TCP Handshake Work? (Recap)
Read time: 6 minutes
Demystifying Data Streams
We're living in a time where everything's online and interconnected. From social media posts, sensor readings, to real-time transaction logs, modern systems are inundated with information at a scale and speed previously unimagined. With this mountain of information, the challenge lies in managing it all. Enter the world of data streams - a brilliant solution to tackle this very problem.
What Exactly Is a Data Stream?
A data stream is a sequence of data that is generated continuously and often at high velocity. Rather than processing data as a static batch, streaming processes the data in real-time (or near-real-time), enabling applications to react swiftly to the incoming information.
Types of Data Streams
The realm of data streams can be quite diverse. Some streams are never-ending (continuous data streams), always supplying apps with fresh data. Others have a clear beginning and end (bounded data streams), often originating from specific datasets. There are also differences in how organized this data is. While some streams are structured and may follow a schema, similar to database tables (structured data streams). Others are more free-form, stemming from sources like text files or media content (unstructured data streams). Lastly, there’s the factor of plurality. Single-source streams come from a single data source, whilst multi-source streams mix and merge data from multiple sources. As you’ve probably guessed, mutli-source streams are more complex to work with but provide much richer insights.
Implementing a Data Stream
To implement a data stream it’s key to understand the volume, velocity, and variability of the data. The three Vs determine the demands of the streaming system. Volume refers to the amount of data generated over a specific timeframe, dictating the storage and processing capacities needed. Velocity touches on the speed at which data is produced and ingested into the system, affecting real-time processing capabilities. Variability, on the other hand, delves into the inconsistencies or fluctuations in the data rate, which can pose challenges in terms of predictability and resource allocation. With these factors in mind, appropriate selection of stream processing frameworks and other tools like data storage solutions can be made.
There are several other factors to keep in mind when building a streaming system. First and foremost, the integrity of the data is vital; it should be consistent, accurate, and reliable. Given the continuous flow of streams, inconsistencies can easily emerge, so proactive measures to prevent them are essential. Additionally, security is paramount — don’t forget to implement security measures to ensure that the data remains protected from unauthorized access.
Data streams, like many other areas of tech, come with a set of guidelines to keep in mind. First off, as the landscape of data continues to expand, it's vital to ensure our setups can horizontally scale. In order to ensure no data is lost, make sure you implement fault tolerance. Lastly, given how dynamic data can be, staying flexible is key. Data streams might evolve due to new sources, differing formats, or varying data amounts. Anticipating these shifts helps systems stay robust and responsive.
As modern applications attempt to deal with a growing amount of data, strategies like data streams have become standout solutions. With their ability to manage real-time data efficiently, adapt to changing conditions, and provide invaluable insights, utilizing data streams has become, and will continue to be, essential for crafting robust and streamlined digital systems.
How Does The TCP Handshake Work? (Recap)
Transmission Control Protocol (TCP) is a transport protocol that is used on top of Internet Protocol to ensure reliable transmission of packets. Essentially, it ensures that all the data you send over the internet reaches its destination correctly and in order.
For devices on a network to exchange data, a connection must first be established. That's where the TCP handshake comes in.
The TCP handshake follows a three-step process to establish a connection:
1) SYN (Synchronize), 2) SYN-ACK (Synchronize-Acknowledge), 3) ACK (Acknowledge)
The TCP handshake uses a flag and sequence number at each step. The flag informs the receiving device of the segment's contents. The sequence number indicates the order of sent data, allowing the receiving end to reassemble data in the correct order.
The Operation Behind Data Processing Systems (Recap)
🔸 Personalized suggestions from Netflix, Amazon, and your favorite online stores or apps are powered by data processing systems.
🔸 The workflow behind data processing systems can be broken down in four areas (in order): 1) data collection, 2) data processing, 3) data storage, 4) usage.
🔸 Data collection is primarily from the following sources: User interactions, server logs, database records, tracking pixels.
🔸 Data processing involves using real-time processing or batch processing systems to clean, transform, and validate data.
🔸 Data storage is often done with distributed file systems, columnar databases, or data warehouses.
🔸 Usage is generally seen in the form of personalization and targeted ads, business intelligence, and predicative analytics.
How to Effectively Use Big O in Technical Interviews (Recap)
A few scenarios where Big O can be used: Live coding challenges, code walk-throughs, discussions about projects/solutions you've built, and discussions about your approach to programming & problem-solving.
When these scenarios come up, be sure to mention the Big O of your solution and how it compares to alternative approaches. Think out loud.
When comparing solutions, pay attention to the problem’s requirements. For example, linear time complexity may be completely fine when the input can never be too large. But if you’re dealing with big data, you’ll want to opt for something more efficient.
Of course, the goal is to get the correct Big O notation that applies to your solution. But don't worry about getting it wrong. The point is to show that you are thinking about the efficiency and performance of your solution. Do this and you’ll be able to showcase an important trait that technical hiring managers look for: The ability to consider a solution's viability beyond whether it works or not. This shows maturity in your decision-making and approach to programming.
That wraps up this week’s issue of Level Up Coding’s newsletter!
Join us again next week where we’ll explore API architectural styles, tokenization, network protocols, and event-driven architecture.