Your Delta tables are drowning. Thousands of tiny Parquet files pile up after every streaming microbatch. Power BI dashboards stall on cold-cache queries. SQL analytics endpoints grind through fragmented row groups. And somewhere in the middle of the medallion architecture, a Spark job is rewriting perfectly good files because nobody told it they were already compacted.
This is the small-file problem at scale — and in Microsoft Fabric, where a single Delta table can serve Spark, SQL analytics endpoint, Power BI Direct Lake, and Warehouse simultaneously, it becomes a cross-workload survival situation. Microsoft recently published a comprehensive cross-workload table maintenance guide that provides a clear map out. Here’s how to use it.
Every Engine Wants Something Different
The core challenge is that each consumption engine has a different idea of what an “optimally sized” file looks like. Get this wrong and you optimize for one consumer while punishing another.
Here’s the terrain:
- Spark reads efficiently across a wide range — 128 MB to 1 GB depending on table size. V-Order isn’t required and adds 15–33% write overhead. Spark cares about parallelism, not VertiPaq encoding.
- SQL analytics endpoint and Warehouse want files around 400 MB with roughly 2 million rows per row group, plus V-Order enabled for an approximate 10% read improvement.
- Power BI Direct Lake is the most demanding consumer. It needs V-Order (delivering 40–60% cold-cache improvement), row groups of 8 million+ rows, and minimal file count to reduce transcoding overhead.
If you serve all three from the same Gold table, you need to make deliberate tradeoffs — or maintain multiple copies optimized for different patterns. Storage is cheap relative to compute. Compute wasted on bad file layouts is not.
The Three Commands That Keep You Alive
Table maintenance in Fabric boils down to three operations: OPTIMIZE, VACUUM, and the configuration pair of auto-compaction and optimize write. Each one addresses a different failure mode.
OPTIMIZE: Bin Compaction
OPTIMIZE consolidates small files into larger ones. It is your primary weapon against file fragmentation:
-- Basic compaction
OPTIMIZE schema_name.table_name
-- With V-Order for Power BI consumers
OPTIMIZE schema_name.table_name VORDER
-- With Z-Order for selective filter queries
OPTIMIZE schema_name.table_name ZORDER BY (region, event_date)
A critical detail: OPTIMIZE is a Spark SQL command. It runs in notebooks, Spark job definitions, and the Lakehouse Maintenance UI. You cannot run it from the SQL analytics endpoint or Warehouse SQL editor.
Before you optimize blindly, use the dry-run option to assess scope:
OPTIMIZE schema_name.table_name DRY RUN
This returns the files eligible for rewriting without touching the table — essential for estimating cost before committing compute.
VACUUM: Dead File Cleanup
After OPTIMIZE rewrites files, the old versions remain on disk for time travel. VACUUM removes files the Delta log no longer references:
-- Default 7-day retention
VACUUM schema_name.table_name
-- Explicit retention
VACUUM schema_name.table_name RETAIN 168 HOURS
The default seven-day retention exists for good reason: concurrent readers and writers may still reference those files. Drop below seven days and you risk reader failures or table corruption. If you must shorten retention, set spark.databricks.delta.retentionDurationCheck.enabled to false — but think carefully before you do.
Auto-Compaction + Optimize Write: Prevention Over Cure
Rather than waiting for file fragmentation to become a problem, these two features prevent it during ingestion:
Optimize write performs pre-write compaction, generating fewer, larger files at write time:
spark.conf.set('spark.databricks.delta.optimizeWrite.enabled', 'true')
Auto-compaction evaluates partition health after each write and triggers synchronous compaction when fragmentation is detected:
spark.conf.set('spark.databricks.delta.autoCompact.enabled', 'true')
Auto-compaction is broadly beneficial and recommended for most ingestion pipelines. Microsoft’s documentation recommends auto-compaction over manually scheduled OPTIMIZE jobs for most workloads, noting it “generally outperforms scheduled compaction jobs at maximizing read/write performance.”
Optimize write, however, is workload-dependent. It adds overhead at write time to coalesce small output files into larger ones. This is valuable for write patterns that naturally produce many small files — streaming microbatch jobs, high-frequency small appends, and similar patterns. For workloads that already produce reasonably sized files (e.g., large batch ETL writing well-partitioned data), optimize write adds overhead without meaningful benefit. Do not enable it by default — evaluate your write pattern first.
The Medallion Layer Checklist
The right maintenance strategy depends on where the table sits in your medallion architecture. Here is a concrete, layer-by-layer breakdown:
Bronze (Landing Zone)
- Priority: Ingestion speed
- Auto-compaction: Enable (optional — can sacrifice for raw speed)
- Optimize write: Workload-dependent — enable only for write patterns that produce many small files (e.g., streaming microbatch, high-frequency small appends). Do not enable by default.
- V-Order: No (unnecessary write overhead)
- Liquid Clustering: No
- Target file size: Use Adaptive Target File Size (ATFS), which dynamically calculates the ideal target. No manual tuning needed for most workloads.
- Scheduled OPTIMIZE: Optional
- Rule: Never serve Bronze tables directly to SQL analytics endpoint or Power BI Direct Lake.
Silver (Curated Zone)
- Priority: Balance ingestion and query performance
- Auto-compaction: Enable
- Optimize write: Workload-dependent — enable for streaming or small-write ingestion patterns; skip for batch ETL that already produces well-sized files.
- V-Order: Optional (enable if SQL or Power BI consumers query this layer)
- Liquid Clustering or Z-Order: Recommended
- Target file size: Use Adaptive Target File Size (ATFS) as the default. ATFS dynamically calculates the ideal file size, eliminating the need to manually specify a target. Only consider a user-defined target file size (e.g., 128–256 MB) in advanced hyper-tuning scenarios — the vast majority of workloads should not go this route.
- Scheduled OPTIMIZE: Generally unnecessary when both auto-compaction and ATFS are enabled. With ATFS, auto-compaction and OPTIMIZE operate on the same dynamic target — so auto-compaction already handles what a scheduled OPTIMIZE would do. A separate OPTIMIZE schedule only matters when ATFS is not used, since auto-compaction defaults to a 128 MB target while OPTIMIZE defaults to 1 GB, creating a compaction gap. With ATFS, this discrepancy goes away. Reserve scheduled OPTIMIZE for edge cases or tables where auto-compaction is disabled.
Gold (Serving Zone)
- Priority: Read performance for analytics
- Auto-compaction: Enable
- Optimize write: Workload-dependent — enable for streaming or small-write ingestion into Gold tables; not required for batch loads that already produce appropriately sized files.
- V-Order: Required for Power BI Direct Lake; beneficial for SQL
- Liquid Clustering: Evaluate the tradeoff — provides flexibility but has high compaction cost in Runtime 1.3 (see LC section). Partitioning is often the better choice until Runtime 2.0. Use LC only when you need to evolve clustering keys or query patterns are unpredictable.
- Target file size: Use Adaptive Target File Size (ATFS) as the default. ATFS dynamically selects the right file size based on your table and workload characteristics. Only deviate to a user-defined target in advanced hyper-tuning scenarios — the overwhelming majority of customers should use ATFS.
- Scheduled OPTIMIZE: Generally unnecessary when both auto-compaction and ATFS are enabled — auto-compaction already targets the same dynamic size that OPTIMIZE would. Without ATFS, a scheduled OPTIMIZE may still be needed because auto-compaction (128 MB default target) leaves files smaller than OPTIMIZE’s 1 GB default target. With ATFS enabled, both operations converge on the same target, making separate scheduling redundant for most workloads.
For Gold tables serving multiple consumers, the target characteristics to keep in mind (when hyper-tuning beyond ATFS):
| Consumer |
V-Order |
Target File Size |
Row Group Size |
| SQL analytics endpoint |
Yes |
400 MB |
2M rows |
| Power BI Direct Lake |
Yes |
400 MB–1 GB |
8M+ rows |
| Spark |
Optional |
128 MB–1 GB |
1–2M rows |
Note: For most workloads, Adaptive Target File Size (ATFS) will dynamically select an appropriate target across these consumers. The table above is reference for advanced tuning only.
V-Order: Know When to Pay the Tax
V-Order applies VertiPaq-compatible sorting, encoding, and compression at write time. The performance gains for Power BI Direct Lake — 40–60% on cold-cache queries — make it indispensable for Gold-layer tables feeding dashboards. But V-Order adds 15–33% to write time and provides no inherent benefit for Spark-to-Spark pipelines.
The decision framework:
- Gold tables → Power BI or SQL consumers: V-Order on.
- Bronze/Silver tables → Spark pipelines only: V-Order off.
- Mixed consumers: Maintain separate copies — a Spark-optimized Silver table and a V-Ordered Gold table.
Set V-Order at the table level for consistency across sessions and jobs:
ALTER TABLE schema_name.gold_table
SET TBLPROPERTIES ('delta.parquet.vorder.enabled' = 'true')
Liquid Clustering vs. Partitioning vs. Z-Order
Liquid Clustering (LC) provides flexibility where partitioning is rigid. With LC, you can change clustering keys without rewriting the entire table, and it can deliver better file skipping for queries that don’t align neatly with partition boundaries. Define it at table creation:
CREATE TABLE schema_name.events (
id INT,
category STRING,
event_date DATE
) CLUSTER BY (category)
But that flexibility comes at a significant cost in Fabric Runtime 1.3. The underlying Delta 3.2 LC implementation reclusters all data every time you run OPTIMIZE — until groups of clustered files exceed 100 GB. For most tables, this means every OPTIMIZE pass rewrites the same data over and over. Compaction time grows linearly with data volume, and there is no way around it in the current runtime.
In practice, this means compaction duration grows linearly over hundreds of OPTIMIZE iterations — each pass reclusters the same data because nothing has crossed the 100 GB clustered-group threshold.
For most scenarios in Runtime 1.3, partitioning remains the better choice. If your query patterns are well-understood and stable — which covers the majority of production analytics workloads — static partitioning gives you equivalent or better file skipping at a fraction of the maintenance cost. LC makes sense when you genuinely need the flexibility to evolve clustering keys over time, or when your query patterns are unpredictable — but understand that you are paying for that flexibility with linearly growing compaction overhead on every OPTIMIZE run.
Use Z-Order when your table is already partitioned (Liquid Clustering does not work with partitioned tables) or when queries filter on two or more columns together.
One critical gotcha regardless of approach: data is only clustered when OPTIMIZE runs. Regular write operations do not apply clustering. Without a compaction strategy, you get zero benefit from Liquid Clustering — the layout never materializes.
Diagnosing Table Health
Before optimizing anything, assess where you stand:
from delta.tables import DeltaTable
details = spark.sql("DESCRIBE DETAIL schema_name.table_name").collect()[0]
print(f"Table size: {details['sizeInBytes'] / (1024**3):.2f} GB")
print(f"Number of files: {details['numFiles']}")
avg_file_mb = (details['sizeInBytes'] / details['numFiles']) / (1024**2)
print(f"Average file size: {avg_file_mb:.2f} MB")
Healthy tables have evenly distributed file sizes within 2× of each other. Files under 25 MB signal fragmentation. Files over 2 GB reduce parallelism. Use DESCRIBE HISTORY to review write patterns and check whether auto-compaction has been running.
Set It at the Table Level
A final, critical best practice: prefer table properties over session configurations. Session settings only apply to the current Spark session and disappear when the session ends. Table properties persist across sessions and ensure consistent behavior regardless of which job or notebook writes to the table:
CREATE TABLE schema_name.optimized_table (
id INT,
data STRING
) TBLPROPERTIES (
'delta.autoOptimize.autoCompact' = 'true',
'delta.parquet.vorder.enabled' = 'true'
)
For tables with write patterns that produce many small files (streaming, high-frequency appends), also add optimize write:
ALTER TABLE schema_name.streaming_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true'
)
This separation ensures optimize write is only applied where it provides value, rather than adding unnecessary write overhead across all tables.
The Bottom Line
Table maintenance in Fabric is not a set-it-and-forget-it operation. It is a deliberate strategy tied to your data’s lifecycle: fast ingestion at Bronze, balanced reads at Silver, and tuned-to-the-consumer performance at Gold. The tools — OPTIMIZE, VACUUM, auto-compaction, V-Order, Liquid Clustering — are all available. The question is whether you deploy them with intention.
Start by auditing your Gold tables. Check file sizes and distributions. Enable auto-compaction at the table level and use Adaptive Target File Size (ATFS) to let the engine dynamically determine the right file target — this eliminates most manual tuning and makes separate scheduled OPTIMIZE runs unnecessary for tables with auto-compaction enabled. Enable optimize write selectively — only on tables with write patterns that produce small files (streaming, frequent small appends). Apply V-Order where Power BI or SQL consumes the data. And run VACUUM weekly to reclaim storage.
Your tables will thank you. Your dashboards will thank you faster.
This post was written with help from Claude Opus 4.6