Glossary
Technical terms and concepts in data engineering, machine learning, and AI
Algorithms/Data Structures
B-tree
A B-tree is a tree data structure optimized for fast indexed key lookups and writes on disk storage while keeping the tree balanced.
Bloom Filter
A Bloom filter is a probabilistic data structure used to test set membership that is space-efficient compared to storing the full set.
CAP Theorem
The CAP theorem states that distributed data systems can only support two of the three guarantees: consistency, availability and partition tolerance.
Collision Resistance
Collision resistance is the property of cryptographic hash functions to minimize chances of different inputs mapping to the same output hash, making it difficult to intentionally cause collisions.
Columnar Memory Format
Columnar memory format stores data in columns rather than rows, allowing for compression and reads optimized for analytics queries.
Consistent Hashing
Consistent hashing is a distributed hash technique that minimizes redistribution of keys when servers are added or removed, used in systems needing scalability and high availability.
Count Min Sketch
A Count Min Sketch is a probabilistic data structure used to estimate item frequencies and counts in data streams.
Data Cardinality
Data cardinality refers to the uniqueness of data values in a particular column or dataset, which has significant impacts on data storage, processing and querying.
Data Pruning
Data pruning refers to database techniques that eliminate irrelevant data during query processing to minimize resource usage and improve performance.
Distributed Hash Table
A distributed hash table (DHT) is a decentralized distributed system that partitions a key space across nodes and uses hash functions to assign ownership and locate data.
FNV Hash
The FNV hash is a fast, simple non-cryptographic hash function that uses modular arithmetic operations to achieve good distribution.
Hash Functions
Hash functions are algorithms that map data of arbitrary size to fixed-size values called hashes in a deterministic, one-way manner for purposes like data integrity and database lookup.
Interval Arithmetic
Interval arithmetic is a method of computing with sets of numbers rather than single values, representing uncertainty in calculations and accounting for rounding errors.
MurmurHash
MurmurHash is a series of fast non-cryptographic hash functions optimized for hash tables and CPU cache performance.
Probabilistic Data Structures
Probabilistic data structures are space and time efficient data structures that use randomized algorithms to provide approximate results to queries with strong guarantees.
Skip List
A skip list is a probabilistic data structure that provides fast search and insertion over an ordered sequence using hierarchy of linked lists to skip over elements.
xxHash
xxHash is an extremely fast non-cryptographic hash algorithm focused on speed and efficiency for checksums and hash tables.
Core Tech
Apache Arrow
Apache Arrow is a cross-language development platform for in-memory data, specifying a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations.
Apache Arrow DataFusion
Apache DataFusion is an extensible, high-performance data processing framework in Rust, designed to efficiently execute analytical queries on large datasets. It utilizes the Apache Arrow in-memory data format.
Data Processing
Batch Processing
Batch processing is the execution of a series of programs or jobs on a set of data in batches without user interaction for efficiently processing high volumes of data.
DataFrame
A DataFrame is a two-dimensional tabular data structure with labeled columns and rows, used for data manipulation and analysis in data science and machine learning workflows.
Distributed Tracing
Distributed tracing is a method used to profile and monitor complex distributed systems by instrumenting apps to log timing data across components, letting operators analyze bottlenecks and failures.
ETL Data Processing
ETL (Extract, Transform, Load) data processing refers to the steps used to collect data from various sources, cleanse and transform it, and load it into a destination system or database.
Incremental Processing
Incremental processing involves continuously processing and updating results as new data arrives, avoiding having to recompute results from scratch each time.
Kappa Architecture
Kappa architecture is a big data processing pattern that uses stream processing for both real-time and historical analytics, avoiding the complexity of hybrid stream and batch processing.
Lambda Architecture
Lambda architecture is a big data processing pattern which combines both batch and real-time stream processing to get the benefits of high throughput and low latency querying.
Online Analytical Processing (OLAP)
Online analytical processing (OLAP) refers to the technology that enables complex multidimensional analytical queries on aggregated, historical data for business intelligence and reporting.
Unified Processing
Unified processing refers to data pipeline architectures that handle batch and real-time processing using a single processing engine, avoiding the complexities of hybrid systems.
Data Storage and Sources
Data Lake
A data lake is a scalable data repository that stores vast amounts of raw data in its native formats until needed.
Data Orchestrator
A data orchestrator is a middleware tool that facilitates the automation of data flows between diverse systems such as data storage systems (e.g. databases), data processing engines (e.g. analytics engines) and APIs (e.g. SaaS platforms for data enrichment).
Data Processing Engine
A data processing engine is a distributed software system designed for high-performance data transformation, analytics, and machine learning workloads on large volumes of data.
Data Warehouse
A data warehouse is a centralized data management system designed to enable business reporting, analytics, and data insights.
Document Store
Document store database manages collections of JSON, XML, or other hierarchical document formats, providing querying and indexing on document contents.
Graph Database
A graph database stores data in a graph structure with nodes, edges and properties to represent and query relationships between connected data entities.
Key-value Store
A key-value store is a type of NoSQL database optimized for storing, retrieving and managing associative arrays of key-value pairs.
Message Broker
A message broker is a software system that facilitates communications between distributed applications and services by transferring messages in a reliable and scalable manner.
RDF Store
An RDF store is a graph database optimized for storing and querying RDF triple data to represent facts and relationships.
Relational Database
A relational database is a type of database that stores and provides access to data according to relations between defined entities organized in tables.
Search Engine (Database)
A search engine database is designed to store, index, and query full text content to enable fast text search and retrieval.
Spatial Database
A spatial database is a database optimized to store, query and manipulate geographic information system (GIS) data like location coordinates, topology, and associated attributes.
SQL Compatibility
SQL compatibility refers to the degree to which a database or analytics system supports the SQL query language standard, enabling the use of standard SQL syntax and features.
Time-series Database (”TSDB”)
A time-series database (TSDB) is a database engineered and optimized for handling time-series data, where each data point contains a timestamp.
Vector Database
A vector database is designed to efficiently store and query vector representations of data for applications like search, recommendations, and AI.
Query Execution
Distributed Execution
Distributed execution refers to techniques to execute database queries efficiently across clustered servers or nodes, dividing work to utilize parallel resources.
Execution Framework
An execution framework is a distributed system that automates and manages aspects like resource allocation, scheduling, fault tolerance and execution of large-scale computational jobs.
Inner Joins
An inner join is a type of join operation used in relational databases to combine rows from two tables based on a common column between them.
Memory Management
Memory management refers to the allocation, deallocation and organization of computer memory resources for running programs and processes efficiently.
Outer Joins
An outer join returns all rows from one or both tables in a join operation, including those without matching rows in the other table. It preserves rows even when no related matches exist.
Parallel Execution
Parallel execution refers to techniques for speeding up database query processing by leveraging multiple CPUs, servers, or resources concurrently.
Partitioning
Database partitioning refers to splitting large tables into smaller, independent pieces called partitions stored across different filegroups, drives or nodes.
Query Execution
Query execution is the process of carrying out the actual steps to retrieve results for a database query as per the generated execution plan.
Query Optimization
Query optimization involves rewriting and transforming database queries to execute more efficiently by performing cost analysis to find faster query plans.
User Defined Functions (UDF)
A user-defined function (UDF) is a programming construct that allows developers to create custom functions in a database, query language or programming framework to extend built-in functionality.