[ SysAdmin  Philosophy  DevOps  DataScience  BigData  ]

Data Has Gravity(original)

The Obsession

Let’s face it. We love data. Even before the internet we had expanding digital storage systems, and we marveled at what percentage of known books could be stored where digitally. That only points to our earlier obsessions with hard copy. We still mourn the loss of the Library of Alexandria, thousands of years later. While we can debate exactly what the root of this is, most can agree that we tie our legacy, or generational success, to our ability to write down and pass on wisdom. Similarly we also see data as a way to find new wisdom and that if we just get enough evidence we can form new thoughts and ideas to transform the world around us (which again, ties back to legacy and the striving of immortality).

The Data Struggle

There comes a price for this obsession though. To store data has never been easy or free. Furthermore, the more there is the harder it can be to protect it. The Library of Alexandria would have been impossible to keep save from any invaders and the only protection that could have hedged it would be replication. But even replication has it’s flaws, especially when you consider the history of books in authoritarian Germany during Hitler’s reign. If a large enough force wants to, destruction is almost impossible to completely hedge. In the modern era this is made even more challenging due to scale. While it might have been impossible for mere mortals to have ever read every page in the great libraries throughout history, the concept of reading “the whole internet” by comparison would be akin to expecting a single ant to eradicate an entire species of elephant. Thousands of years ago we used sages and priests to guide us through the “relevant” data in our libraries, and later educators and more would feed us recommendations and distillations of meaningful works so that our journey of knowing was at least approaching something resembling palatable. In the digital world we attribute much of the adoption and growth of our communications based on our ability to find meaningful data sources via search engines. They are our digital priests, our teachers and educators. They tell us what is worth reading and distill information into useful compounds. This mechanism has come with a price though. With the advent of scalable internet searching, came a wave of data growth almost impossible to comprehend. This is like comparing plants and animals to stars and planets. The order of magnitude is such that it’s hard to consider these scales can even comprehend each other.

Byte by Byte

The expansion of data over the last half-century now has ocurred relentlessly exponentially. This has been made possible thanks to assumptions like Moore’s Law, that computer(digital) systems double in performance every two years (roughly). One could argue that for many years this law was kept as a self-fulfilling prophecy, in which the industry moved under expectations and that as challenges thought insurmountable arose, an asymmetrical amount of effort was poured into overcoming the challenge. This gamble has now lead to where we only keep moore’s law around like a mummy in a casket, doing our best to preserve something everyone knows is dead. In the process of this tidal wave of need arising from the digital sea, coming to crash against the shores of capacity we’re learning that our actions are not without consequences. Did you ever put photos on Photobucket? Go try to load a YouTube video you made more than 8 years ago and only had a handful a views from family and friends. Digital storage warehouses bet big on data scaling, and in the short term won. The big lesson they’ve learned though, is that the reality is that not all data is worth keeping, and it is almost impossible for them to manage the debt of worthless data at scale.

This Is Fine

The result of our appetite for data, and the subsequence encouragement to hoard every byte ever generated from digital product platforms is that have gone from living in a data desert to a data jungle. Furthermore, service like YouTube have shown evidence that data truly has gravity. Data like a single rock might seem relatively benign by itself. And piling up enough of it to make a mountain can look majestic. The unexpected by many reality though is that once you had more than a mountain range, it started pulling back, hard. Once data reached a critical mass, it started giving birth to it’s own objects in orbit. We’re quickly reaching a reality in which you can’t throw a rock off the data planet anymore. It’s stuck there forever. Organizations/Corporations toss around trendy words like Data Lake as though they are speaking magical incantations, as though connecting the right tools will solve underlying data management challenges. In reality, the architects, engineers, and administrators suffer the same fate as anyone burdened with the weight of the world on them. All the data lake believes have succeeded in doing, is to create satellites around the data planets which can allow for reflections across the “lake” to bounce between important parties willing to also pay for satellites. This is why we end up with the well-known data swamp. If the “lake” isn’t kept pure (which is almost impossible at scale due to the gravity problem) then generating connection features becomes useless. Garbage in, garbage out.

Root of All Evil

If we can start to see the problems with assuming infinite expansion of storage, and thus the impossibility of keeping all data forever, then we can start attempting to discover solutions. The solution for the teenager who filled up their hard drive was a simple one: remove lowest priority data. This could work in theory today, but beyond just the daunting task nobody wants to do unless forced to, we have other issues preventing this at scale. What happens when the owner of data at a company moves onto another company? Even just data ownership as someone changes positions within a company can be difficult. There’s also ownership hierarchy that’s challenging. It is not uncommon to see scenarios where those closest to the data feel unable to make decisions due to project, and thus data, ownership is a shared tree moving upwards and outwards, generating dependency trees almost impossible for the data owner to fully understand the impact of retention versus removal. This is where the ideas and tooling of data lakes can help though they’re not and end-goal solution. If organizations can be provided tools to easily weight the value of data, then as long as nearly everyone contributes to the scoring, then irrelevant data will stand out like weeds, and be plucked from the system. This keeps the waters of lake clear, but we have made a large assumption of participation.

Disentangle, Disassociate

I believe that Zhamak Dehghani was truly onto something with the release of Data Mesh. The principles provided by the data mesh are a proper iteration forward from the data lake model, and start to force people to act like the teenager with a full hard drive. If you can not provide your own data as a service to others, then it is not data worth having. Pushing the management to the owners goes a long way to starting to solve many data issues. The counter point though, is that this still does not scale when you start to look and think societal/globally. While moving towards a data mesh principle is what all companies should be doing, they also need to find ways to commoditize their data within the mesh so that the erosion of teams over time doesn’t create stale nodes in their fabric. Much like the issues facing data lakes, there is still a very large need to manage the data mesh solutions culturally, which of course is almost impossible to manage at scale indefinitely. Every organization will have it’s ups and downs and times of struggle and technical debt will reside after these lows. Companies are bought at sold with teams going through reorgs and layoffs. The data mesh will suffer damage just the same as a data lake.

A New Thought

There is a hope on the horizon though. The very nature of AI research has born new ideas and thoughts around how to distill data. Companies like Tesla are generating likely terabytes per minute and using 100% of this data for their self-driving model training would be a waste of resources, let alone nearly impossible to actually perform computationally in anyone’s life time. The solution is to only feed the training data that’s meaningful and remove or compress data that’s less meaningful. The best part, but also the radical promise sweeping the world today in 2023, is that the model itself can tell you what is meaningful or not based upon the activity from inference. If there’s data in the training that’s never hit when a vehicle is self-driving and inferencing the model, then it can likely be removed or at least flagged for review for removal before the next training iteration. This approach can, and likely will, be applied to data at scale once the right AI model/utility is designed and widely available. In theory, we can let AI act similar to our own human brains, and like a memory, if data is never accessed or proved otherwise valuable then it can be safely discarded or at minimum distilled into a core essence of it’s original intent in case it were ever needed to be recalled.

Promises of Hope

The struggle with the AI hope, is that it could just be another iteration of solutions that never resolve the core issues. While I believe we should all retain a healthy skepticism of all the AI buzz there is also a very real hope that’s worth engaging in. Many of us would be much better stewards of our data, if we just had more time in the day. How many times do we joke about wishing we could clone ourselves? This reality might not be far from around the corner in pockets of our mental capacities. Imagine telling an AI robot to clean up your data for you, asking for less and less guidance as it starts to figure out what you really care to keep. These are no longer empty promises, but a matter of when, not if anymore. It is to this end I even used ChatGPT to proof edit this article, and have made zero changes, though I’m providing the diff for you here.

Conclusion

To love data is no better than loving money, and while it might drive you towards a form of success in life it can only lead to ruin in the long run. We need to develop healthy habits around our data, and it seems like there are more tools than ever in helping us get there. AI will likely end up being a key tool/resource in this development and potentially end up also being the number one user of all our tools and philosophies which predate the AI tools itself. I encourage all to investigate their own problems as to better know how to frame the conversation with impending reality of all their search tools being replaced by AI interfaces.

Written on June 5, 2023