Excellent distinction between classification and structure. Too many orgs treat layer patterns as substitute for true integration architecture. Seen this play out where teams adopt medallion labeling but then struggle when business keys change or source systems merge, because no underlying immutabile record exists. The category error point is crucial, naming conventions cant replace hitorical preservation logic.
"Databricks describes Medallion as a data organization and refinement approach using Bronze, Silver, and Gold to indicate raw, refined, and curated data states."
Hi Bill (and other readers),
Medallion is a marketing gimmick, no more, no less. They are perfectly entitled to do that. It's fine. I wish I was "amazed" that anyone confuses Medallion with "architecture" but given the state of our industry? I am not.
Just FYI.
I tried to get my ETL software to work on Data Bricks. I didn't have any luck. If you know anyone over at Data Bricks who would like to give it a try? They are welcome to. It's free and the free version can map source to target around 6-8K fields per 220 hour work month.
We have run my ETL software on Snowflake and you / Dan are welcome to let Kent Graziano know about that. It seemed to work but we didn't spend much time testing it. Again, anyone at Snowflake who wants to give it a try are welcome to. All free source code now.
Why would they want to give it a try?
Because my ETL software vastly reduces the development and support costs of getting data into their databases. That means there is more money available to spend on their databases. Pretty simple reason really.
Also just FYI.
I have a version I use for myself where we now achieve mapping rates around 12,000 fields per month in a 220 hour work month. I know that sounds like I am smoking meth, but we have actually had (very long 16 hour) days where we have cracked mapping 1,000 fields in day. Given that 1,000 field per work month was my standard from 1995-2017 to be able to do that in one day is really something.
Basically I have cracked that nut that means we can now map all fields in what I call "large operational systems" across to a target dimensional data warehouse in an economically viable fashion.
I can't talk about this in public. But one of my clients now has a data warehouse model in excess of 100,000 fields. That model is expected to get to 250K fields later this year.
As you and I both know the idea of having a data warehouse data model with 250K+ fields in it sounds ridiculous. But it's going to happen this year is my prediction.
Our data warehousing industry is in poor shape with so many failed projects and so much negativity about it. It's a bit sad given that when done properly a data warehouse will deliver very significant business value every time.
I have never had a customer fail to make a lot of money out of their data warehouse when they took my advice. When they don't take my advice? Most of them not only pay a heavy price for the failed project but then pay a heavy price for not knowing how to run their business properly.
My most famous such case is Electronic Arts. I was there nearly all 2006 and it was VERY political. They just would not do as I advised them. And in 2008 they lost have their market cap because they refused to implement "market sensing" as I had recommended and they were left way behind in the transition to internet based gaming.
I fully agree and I generally label Medallion as "M-Architecture" or Marketing masquerading as Architecture.
Medallion does not prescribe anything, it just labels hoped-for states of data.
All projects plan to have "data as we got it", "data that has been somehow improved", "data that we consider ready for consumption" so it applies everywhere, saying nothing really useful. It does not even help in suggesting that original data should be stored and kept immutable.
Developers are eager to "adopt it" as it let them go on doing the same stuff they always did (good or bad), but with a nice name that their manager can like and feel good at understanding.
As you said, I am sure the consequences for taking Medallion as an architecture will be felt hard. I hope more team are enlightened to the difference with real Architecture.
Although I largely agree with much of what this article says, I can't help thinking that some metaphors have been mixed here. Specifically, I'm not entirely sure what the data model has to do with the data warehouse's architecture.
The best analogy for a data warehouse is a physical warehouse. The architecture of the warehouse includes the building and its layout. It should include places where goods are received and dispatched; internal lanes within the space for moving goods; the storage itself; and any ancillary services like electricity, water, restrooms, kitchen, etc. What isn't part of the architecture is the stuff stored in the warehouse, the goods placed in bins positioned on shelves.
The "data model" equivalent in a warehouse is the organising scheme for those goods. If I have a widget, then it goes in this location; if I have a gadget, it goes over there. How is that architecture? If the warehouse is emptied of its contents, does it still have recognisable architecture? Of course it does. If we come up with a different organising scheme and put the goods back into the warehouse in different locations, has the architecture changed? Of course not!
However, the modelling paradigm is part of the architecture. How we build the warehouse is influenced by how we intend to structure the data. A Kimball warehouse has different architectural needs than Data Vault or Hook, but the fundamentals of the warehouse of all three will not be dissimilar.
So, can we PLEASE stop including data modelling in the conversations about architecture? They really aren't the same thing.
The vendors are selling shortcuts at every corner of the data warehouse landscape. Snowflake was sold as the data warehouse database and arrives as an empty box. Data Lakes (then Data Lakehouses) encouraged us to think of the landing zone as the only stage. Medallion promoted further refinement without lineage. Data meshes arrived to assure us that distributed marts (that did not reconcile) were solid. Has the age of the Data Vault has arrived to restore order?
Excellent distinction between classification and structure. Too many orgs treat layer patterns as substitute for true integration architecture. Seen this play out where teams adopt medallion labeling but then struggle when business keys change or source systems merge, because no underlying immutabile record exists. The category error point is crucial, naming conventions cant replace hitorical preservation logic.
"WHY CALLING MEDALLION AN ARCHITECTURE IS A CATEGORY ERROR
Architecture governs structure. Classification labels state. These are not interchangeable."
Amen!!!!!
How can I like and applaud this harder and louder. Beautifully written!
"Databricks describes Medallion as a data organization and refinement approach using Bronze, Silver, and Gold to indicate raw, refined, and curated data states."
Hi Bill (and other readers),
Medallion is a marketing gimmick, no more, no less. They are perfectly entitled to do that. It's fine. I wish I was "amazed" that anyone confuses Medallion with "architecture" but given the state of our industry? I am not.
Just FYI.
I tried to get my ETL software to work on Data Bricks. I didn't have any luck. If you know anyone over at Data Bricks who would like to give it a try? They are welcome to. It's free and the free version can map source to target around 6-8K fields per 220 hour work month.
We have run my ETL software on Snowflake and you / Dan are welcome to let Kent Graziano know about that. It seemed to work but we didn't spend much time testing it. Again, anyone at Snowflake who wants to give it a try are welcome to. All free source code now.
Why would they want to give it a try?
Because my ETL software vastly reduces the development and support costs of getting data into their databases. That means there is more money available to spend on their databases. Pretty simple reason really.
Also just FYI.
I have a version I use for myself where we now achieve mapping rates around 12,000 fields per month in a 220 hour work month. I know that sounds like I am smoking meth, but we have actually had (very long 16 hour) days where we have cracked mapping 1,000 fields in day. Given that 1,000 field per work month was my standard from 1995-2017 to be able to do that in one day is really something.
Basically I have cracked that nut that means we can now map all fields in what I call "large operational systems" across to a target dimensional data warehouse in an economically viable fashion.
I can't talk about this in public. But one of my clients now has a data warehouse model in excess of 100,000 fields. That model is expected to get to 250K fields later this year.
As you and I both know the idea of having a data warehouse data model with 250K+ fields in it sounds ridiculous. But it's going to happen this year is my prediction.
Our data warehousing industry is in poor shape with so many failed projects and so much negativity about it. It's a bit sad given that when done properly a data warehouse will deliver very significant business value every time.
I have never had a customer fail to make a lot of money out of their data warehouse when they took my advice. When they don't take my advice? Most of them not only pay a heavy price for the failed project but then pay a heavy price for not knowing how to run their business properly.
My most famous such case is Electronic Arts. I was there nearly all 2006 and it was VERY political. They just would not do as I advised them. And in 2008 they lost have their market cap because they refused to implement "market sensing" as I had recommended and they were left way behind in the transition to internet based gaming.
Another excellent article Bill. Nice to see you here on Substack given I am still banned off Linkedin.
Bill, Dan,
I fully agree and I generally label Medallion as "M-Architecture" or Marketing masquerading as Architecture.
Medallion does not prescribe anything, it just labels hoped-for states of data.
All projects plan to have "data as we got it", "data that has been somehow improved", "data that we consider ready for consumption" so it applies everywhere, saying nothing really useful. It does not even help in suggesting that original data should be stored and kept immutable.
Developers are eager to "adopt it" as it let them go on doing the same stuff they always did (good or bad), but with a nice name that their manager can like and feel good at understanding.
As you said, I am sure the consequences for taking Medallion as an architecture will be felt hard. I hope more team are enlightened to the difference with real Architecture.
Ciao, Roberto
Although I largely agree with much of what this article says, I can't help thinking that some metaphors have been mixed here. Specifically, I'm not entirely sure what the data model has to do with the data warehouse's architecture.
The best analogy for a data warehouse is a physical warehouse. The architecture of the warehouse includes the building and its layout. It should include places where goods are received and dispatched; internal lanes within the space for moving goods; the storage itself; and any ancillary services like electricity, water, restrooms, kitchen, etc. What isn't part of the architecture is the stuff stored in the warehouse, the goods placed in bins positioned on shelves.
The "data model" equivalent in a warehouse is the organising scheme for those goods. If I have a widget, then it goes in this location; if I have a gadget, it goes over there. How is that architecture? If the warehouse is emptied of its contents, does it still have recognisable architecture? Of course it does. If we come up with a different organising scheme and put the goods back into the warehouse in different locations, has the architecture changed? Of course not!
However, the modelling paradigm is part of the architecture. How we build the warehouse is influenced by how we intend to structure the data. A Kimball warehouse has different architectural needs than Data Vault or Hook, but the fundamentals of the warehouse of all three will not be dissimilar.
So, can we PLEASE stop including data modelling in the conversations about architecture? They really aren't the same thing.
The vendors are selling shortcuts at every corner of the data warehouse landscape. Snowflake was sold as the data warehouse database and arrives as an empty box. Data Lakes (then Data Lakehouses) encouraged us to think of the landing zone as the only stage. Medallion promoted further refinement without lineage. Data meshes arrived to assure us that distributed marts (that did not reconcile) were solid. Has the age of the Data Vault has arrived to restore order?