0
Technology

Python Engineering at Microsoft: Introducing Apache Arrow Support in mssql-python

May 4, 2026
Scroll

Posted 3 hours ago by

Reviewed by Sumit Sarabhai Fetching a million rows from SQL Server into a Polars DataFrame used to mean a million Python objects, a million GC allocations, and then throwing it all away to build a DataFrame. Not anymore. mssql-python now supports fetching SQL Server data directly as Apache Arrow structures – a faster and more memory-efficient path for anyone working with SQL Server data in Polars, Pandas, DuckDB, or any other Arrow-native library.

This feature was contributed by community developer Felix Graßl (@ffelixg), and we are thrilled to ship it. Key TermsAPI (Application Programming Interface): a source-code contract that defines how to call a function or library. ABI (Application Binary Interface): a binary-level contract that specifies how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly – no serialization is needed. Arrow C Data Interface: Apache Arrow’s ABI specification – the standard that makes zero-copy data exchange between languages possible. What Is Apache Arrow? The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout – the Arrow C Data Interface, a cross-language ABI – that any language can produce or consume by exchanging a pointer, with no serialization, no copies, and no re-parsing. A C++ database driver and a Python DataFrame library can work on the exact same memory without either one knowing about the other. Built on top of that, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows, each row a collection of Python objects, Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects. For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers – no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations – filters, joins, aggregations – also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines. For users of mssql-python, this translates into four concrete benefits: Speed: The columnar fetch path avoids Python object creation per row, which should make fetching noticeably faster for many SQL Server types – especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely. Lower memory usage: A column of one million integers is a single contiguous C array, not a million individual Python objects. Seamless interoperability: Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and many other libraries all speak Arrow natively. Zero-copy hand-off between mssql-python and those tools. Purely additive: Your existing fetchone, fetchmany, and fetchall code is completely unaffected. You opt in only where you need it. Try it here: pip install mssql-python Calling all Python + SQL developers! We invite the community to try out mssql-python and help us shape the future of high-performance SQL Server connectivity in Python.! The Arrow Fetch APIs Three APIs have been added to the Cursor object. 1. cursor.arrow_batch(batch_size=8192) pyarrow.RecordBatch Fetches the next batch of up to batch_size rows as an Arrow RecordBatch and advances the cursor. RecordBatches are the building block for more high-level Arrow data types like tables and the batch reader interface. import mssql_python conn = mssql_python.connect(conn_str) cursor = conn.cursor() cursor.execute(SELECT * FROM SalesData) partial_data = cursor.arrow_batch(batch_size=50000) process(partial_data) pyarrow.RecordBatch 2. cursor.arrow(batch_size=8192) pyarrow.Table Eagerly fetches the entire result set into a single Arrow Table. This is the simplest path and works well for analytics queries where the result fits comfortably in memory. However, because it materialises the full result set at once, it can cause high peak RAM usage or out-of-memory errors on very large or unbounded queries. For large exports or ETL workloads, prefer cursor.arrow_reader() (streaming, fetches lazily) or cursor.arrow_batch() (fetch one batch at a time). In both cases, batch_size is a tuning knob: larger batches improve throughput but increase peak memory; smaller batches reduce memory at the cost of slightly more per-batch overhead. cursor.execute(SELECT customer_id, order_date, amount FROM Orders) table = cursor.arrow() Zero-copy conversion to Polars import polars as pl df = pl.DataFrame(table) Or to Pandas with Arrow-backed dtypes import pandas as pd df = table.to_pandas(types_mapper=pd.ArrowDtype) 3. cursor.arrow_reader(batch_size=8192) pyarrow.RecordBatchReader Returns a lazy RecordBatchReader. Batches are fetched only when the reader is iterated, enabling streaming over very large result sets. RecordBatchReader is also accepted directly by DuckDB, Lance, and other Arrow-native libraries. cursor.execute(SELECT * FROM LargeEventLog) reader = cursor.arrow_reader(batch_size=100000) for batch in reader: sink.write(batch) Testing We validated the Arrow fetch path against the standard Python row fetch path across a range of SQL Server types — numeric, temporal, string, and UUID – for both single-column and wide (20-column) tables. The full test script and results are available in the Resources section; we encourage you to run them on your own hardware to see the difference for your workload. In our testing, the Arrow path was consistently faster for most SQL Server types. Temporal types showed the largest gains: types like DATETIME and DATETIMEOFFSET benefit significantly because the Arrow path handles timezone normalization and value encoding entirely in C++, eliminating per-value Python-side conversions. DATETIMEOFFSET in particular showed some of the most pronounced speedups we observed. JSON Serialization Bonus The Arrow path can also benefit API workloads that serialize results to JSON. Instead of fetchall() + json.dumps(), fetch via cursor.arrow(), wrap in a Polars DataFrame, and call df.write_json() – the entire pipeline bypasses Python objects and can be noticeably faster, especially for types like DATETIMEOFFSET. NVARCHAR on LinuxOur Linux tests show longer fetch times for NVARCHAR due to the current UTF-16 UTF-8 conversion path. On Windows, NVARCHAR fetches consistently faster with Arrow. A fix is targeted for a follow-up release. Getting Started Install or upgrade mssql-python, then add pyarrow: pip install mssql-python pyarrow For IDE type hints and static type checking: pip install pyarrow-stubs Then swap in cursor.arrow() wherever you would have called fetchall() and converted to a DataFrame. Your existing code is completely unaffected — Arrow support is purely additive. import mssql_python import polars as pl conn = mssql_python.connect(conn_str) cursor = conn.cursor() cursor.execute(SELECT * FROM dbo.LargeSalesTable) df = pl.DataFrame(cursor.arrow()) print(df.describe()) What’s Next One known area we are actively working on to improve is NVARCHAR performance on Linux. SQL Server returns Unicode string data in UTF-16 encoding, which the driver must convert to UTF-8 before handing it to Arrow. On Windows this conversion uses a native system API that is very fast, but the current Linux code path goes through a slower chain of intermediate steps. As a result, NVARCHAR columns on Linux show longer fetch times compared to the Python fetch path — the opposite of every other type. A fix using a more efficient codec is in progress for a follow-up release. On Windows, our tests show NVARCHAR fetching noticeably faster with Arrow, and Linux will follow. A Note of Thanks This feature was contributed by Felix Graßl (@ffelixg), the author of zodbc, his own Zig-based ODBC driver. His deep familiarity with ODBC and Arrow made this a thorough, well-tested contribution covering both Linux and Windows, and all three fetch patterns. We are very grateful for his work and the care he brought to this feature. Resources mssql-python on GitHub Arrow Support PR Full test script and results Apache Arrow C Data Interface zodbc — Felix’s Zig-based ODBC driver Try It and Share Your Feedback! We invite you to: Check-out the mssql-python driver and integrate it into your projects. Share your thoughts: Open issues, suggest features, and contribute to the project. Join the conversation: GitHub Discussions | SQL Server Tech Community. Use Python Driver with Free Azure SQL DatabaseYou can use the Python Driver with the free version of Azure SQL Database! Deploy Azure SQL Database for free Deploy Azure SQL Managed Instance for free Perfect for testing, development, or learning scenarios without incurring costs. Have questions or feedback? Open an issue or discussion on GitHub, or reach out to the team at mssql-python@microsoft.com The post Introducing Apache Arrow Support in mssql-python appeared first on Microsoft for Python Developers Blog.

Planet Python
Planet Python

Coverage and analysis from United States of America. All insights are generated by our AI narrative analysis engine.

United States of America
Bias: Unknown

People's Voices (0)

Leave a comment
0/500
Note: Comments are moderated. Please keep it civil. Max 3 comments per day.
You might also like

Explore More