Discover how we're achieving our vision of unifying data in real-time by exploring our team's thoughts, ideas, and experiences.
Windowing queries in stream processing play a pivotal role in handling time-series data. This post unravels how to harness streaming-friendly window functions in queries with just using ANSI-SQL, emphasizing the importance of ordering for achieving optimal results in streaming datasets.
The Sliding Window Hash Join (SWHJ) algorithm joins potentially infinite streams while preserving the order by building hash tables incrementally, storing only relevant rows from the build side that fall within a sliding window, allowing efficient processing of streams without materializing all data.
The Count-Min Sketch uses hash functions to map streamed items into a 2D counter array. When processing the stream, items are hashed to incremented counters, frequencies are est. by taking the min count across rows for an item's hashes.
Our CEO Ozan recently joined an episode of the Streaming Caffeine podcast — Streaming Caffeine E10: Ozan from Synnada, about Arrow Datafusion, Rust, Databases, SQL, AI — to discuss our perspective on DataFusion and the future of data infrastructure.
This post explores how pioneering teams at Airbnb, Uber, and Apache Arrow overcame the data chasm, followed by an introduction to the Lean Data Stack paradigm as a way to build durable, economical, and flexible data systems.
The data ecosystem is rapidly expanding and fragmenting, posing integration challenges industry-wide. Many companies fall into a "data chasm", needing to abruptly scale their tools from 2-4 to 15-20, exacerbating complexity. Some organizations pioneered methodologies to cross this chasm and extract value. How can others navigate this data chasm?
This blog post explores the AI/ML landscape, comparing it to a gold rush where the focus is on providing "specialized electricity" in the form of computing, storage, and networking resources.