Storage Format

Account numberLast nameFirst namePurchase (in dollars)
1001GreenRachel20.12
1002GellerRoss12.25
1003BingChandler45.25

Row Oriented Storage

In a row-oriented DBMS, the data would be stored as

**1001,Green,Rachel,20.12;**1002,Geller,Ross,12.25;1003,Bing,Chandler,45.25

Best suited for OLTP - Transaction data.

Columnar Oriented Storage

1001,1002,1003;Green,Geller,Bing;Rachel,Ross,Chandler;20.12,12.25,45.25

Best suited for OLAP - Analytical data.

Compression: Since the data in a column tends to be of the same type (e.g., all integers, all strings), and often similar values, it can be compressed much more effectively than row-based data.

Query Performance: Queries that only access a subset of columns can read just the data they need, reducing disk I/O and significantly speeding up query execution.

Analytic Processing: Columnar storage is well-suited for analytical queries and data warehousing, which often involve complex calculations over large amounts of data. Since these queries often only affect a subset of the columns in a table, columnar storage can lead to significant performance improvements.

Img Src: https://mariadb.com/resources/blog/why-is-columnstore-important/


CSV/TSV/Parquet

  • Comma Separated Values
  • Tab Separated Values

**Pros

  • Tabular Row storage.
  • Human-readable is easy to edit manually.
  • Simple schema.
  • Easy to implement and parse the file(s).

Cons

  • There is no standard way to present binary data.
  • No complex data types.
  • Large in size.

Parquet

Parquet is a columnar storage file format optimized for use with Apache Hadoop and related big data processing frameworks. Twitter and Cloudera developed it to provide a compact and efficient way of storing large, flat datasets.

Best for WORM (Write Once Read Many)

The key features of Parquet are:

Columnar Storage: Parquet is optimized for columnar storage, unlike row-based files like CSV or TSV. This allows it to compress and encode data efficiently, making it a good fit for storing data frames.

Schema Evolution: Parquet supports complex nested data structures, and the schema can be modified over time. This provides much flexibility when dealing with data that may evolve.

Compression and Encoding: Parquet allows for highly efficient compression and encoding schemes. This is because columnar storage makes better compression and encoding schemes possible, which can lead to significant storage savings.

Language Agnostic: Parquet is built from the ground up for use in many languages. Official libraries are available for reading and writing Parquet files in many languages, including Java, C++, Python, and more.

Integration: Parquet is designed to integrate well with various big data frameworks. It has deep support in Apache Hadoop, Apache Spark, and Apache Hive and works well with other data processing frameworks.

In short, Parquet is a powerful tool in the big data ecosystem due to its efficiency, flexibility, and compatibility with a wide range of tools and languages.

CSV vs Parquet

MetricCSVParquet
File Size~1 GB100-300 MB
Read SpeedSlowerFaster for columnar ops
Write SpeedFasterSlower due to compression
Schema SupportNoneStrong with metadata
Data TypesBasicWide range
Query PerformanceSlowerFaster
CompatibilityUniversalRequires specific tools
Use CasesSimple data exchangeLarge-scale data processing

These metrics highlight the advantages of using Parquet for efficiency and performance, especially in big data scenarios, while CSV remains useful for simplicity and compatibility.

Apache Arrow (https://arrow.apache.org/)

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

While Parquet is a storage format and Arrow is an in-memory format, they are often used together. Data stored in Parquet files can be read into Arrow’s in-memory format for processing, and vice versa.

Both formats are maintained by the Apache Software Foundation, and they are designed to complement each other. Arrow provides a standard in-memory format, while Parquet provides a standard on-disk format. Together, they enable efficient data processing workflows that involve both storage and in-memory analytics.

Polars

Polars is a high-performance DataFrame library designed for Rust and Python, aiming to provide fast data manipulation capabilities similar to those found in libraries like Pandas for Python.

  • Performance: Polars is built for speed, leveraging Rust’s performance capabilities.

  • Lazy Execution: Polars supports lazy execution, allowing you to build complex query plans that are only executed when needed. This can optimize performance by minimizing unnecessary computations.

  • Expressive API: Polars offers an expressive and flexible API for data manipulation, including support for operations like filtering, aggregation, joining, and more.

  • Interoperability: While Polars is native to Rust, it also has a Python API, making it accessible to a broader range of developers.

Sure, here's a tabular comparison of Polars and Pandas:

FeaturePolarsPandas
LanguageRust, with Python bindingsPython
PerformanceHigh performance due to parallel execution and memory efficiencyGenerally slower for large datasets, single-threaded execution
Memory UsageMore memory efficientHigher memory usage
Lazy ExecutionYes, supports lazy evaluation and query optimizationNo, operations are immediately executed
APIExpressive and composable API, consistent behaviorMature and versatile API, some inconsistencies
Type SafetyStrong type safety due to RustDynamically typed
Memory SafetyEnsured by Rust's ownership modelRelies on Python's garbage collector
ScalabilityBetter for large datasets and complex operationsCan struggle with very large datasets
InteroperabilitySupports Rust and Python, integrates with Apache ArrowPrimarily Python, integrates with many Python libraries
GroupBy OperationsFast and efficient, especially on large datasetsSlower, can be memory intensive
Handling Large DataCan handle larger-than-memory datasets more efficientlyLimited by memory size
MultithreadingYes, utilizes multiple CPU coresNo, single-threaded execution
Data RepresentationUses Apache Arrow formatNative Pandas data structures
RobustnessVery robust due to Rust's type and memory safetyRobust, but can have runtime errors
Ease of UseRequires familiarity with Rust concepts for advanced useEasy to use, especially for those familiar with Python
Community and EcosystemGrowing community, less extensive ecosystem compared to PandasLarge community, extensive ecosystem and support

Conclusion

  • Polars is ideal for high-performance requirements, handling large datasets, and applications where memory efficiency and parallel execution are critical. It benefits from Rust's safety features and offers a powerful, composable API.
  • Pandas is great for general data manipulation and analysis tasks, with a mature and versatile API, extensive ecosystem, and ease of use, particularly for Python developers.

Choosing between Polars and Pandas depends on your specific needs, including performance requirements, dataset size, and preferred development language.

Demo

https://github.com/gchandra10/rust-polars-csv-dataframe-demo

Convert CSV to Parquet

https://crates.io/crates/csv2parquet

cargo install csv2parquet

csv2parquet sales_100.csv sales_100.parquet

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;
use std::path::Path;

fn main() {
    let file = File::open(&Path::new("sales_100.parquet")).unwrap();
    let reader = SerializedFileReader::new(file).unwrap();
    let mut iter = reader.get_row_iter(None).unwrap();

    // Retrieve the schema and column names
    let schema = reader.metadata().file_metadata().schema_descr();
    let columns: Vec<String> = schema.columns().iter().map(|c| c.name().to_string()).collect();

     // Print the header
     println!("{}", columns.join(","));

    while let Some(record) = iter.next() {
        println!("{:?}", record);
    }

}