Code == Data

On the inseparability of code and data

Oct 28, 2025

In my current role, I’m in a unique position where I’m steering the company’s AI strategy. Part of this role I find myself teaching a large organization of engineers about AI. How it works, how to use it, how to build it.

A recurring theme that comes up over and over again is how important the data is to AI. It’s something every engineer “knows” but it’s not something they tend to have a practical sense for. It comes up in small ways, like deciding how much data to collect from a potential client when building a POC, “We don’t want to ask for too much so a couple rows should be good right?” Wrong. That’s not enough to train, let alone set anything aside for an evaluation set. It’s one thing to “know” AI isn’t deterministic, it’s another to actually systematically build a solution to use that to your advantage.

But I’ve also seen the other side of the fence, where data practitioners would rather live in excel sheets, crunching stats, and creating data visualizations. Code seems to be a necessary evil, to write a few SQL queries or EDA notebooks, but not something to take seriously. They’d rather be reading arXiv than a teammates PR, so LGTM and move on.

Code and Data worlds mixing as well as a halocline (look it up).

There seems to be this great divide between programmers and data peeps. They run in different circles, go to different conferences, and work on different teams. One writes software that just happens to handle data, and the other just happens to write software so they can handle data. They talk about the same problems, same solutions, different jargon. Two groups doing the same job, but not.

I’ve had to straddle this great divide my entire career, here are some of my musings.

The Great Divide

One of my most viral posts in my tenure was just four words and a symbol: “Clean Code > Clean Data”. At the time the data echo chamber was filled with post after post, and meme after meme of the same didactic platitude, “Garbage in, Garbage out”. Everyone (and by that I mean Data Scientist and Analysts—executives of course could care less) was obsessed with clean data, but couldn’t be convinced to spend two more minutes to make their code clean.

Ironically, this was the same code they used to clean the data. Which often led to stupid mistakes in their data pipelines like forgetting they hard coded a variable during troubleshooting or using column indexes instead of column names. Down the road these mistakes add up and the very processes meant to clean the data made them worse off.

To be honest, I didn’t really think about the post too hard. My reasoning was simple. Code makes data ergo dirty code makes dirty data. This thought though ended up being overwhelmingly controversial. So many engineers who got stuck trying to productionize a data scientists notebook felt validated, and so many data scientists felt like I insulted their life’s work.

A data scientist faces a software engineer. — AI’s interpretation of a modern data shown down between a software engineer and a data scientist.

The Chicken or the Egg?

When I was a research assistant back in college. We were studying ash build-up on turbine blades, but that’s not important, what is important is that we had a very nice color pixel camera that we would point at the turbine blade. As metal heats up, it turns red. It becomes “red hot” if you’ve heard of the phrase. Turns out, if you measure the intensity of the red pixels you can figure out the surface temperature of a metal that is too hot to measure with traditional methods.

This was before the term “data science” was even coined, but linear regression still existed. And so like a true MacGyver, equipped only with some equations on light waves from an old physics book, some data I manually collected with the camera, and some janky Matlab code, I was able to create a model. The model converted pixels to temperature and improved our surface temperate accuracy by 75%. Which of course had a huge impact on our research.

I was making data from code, and if I improved that code I improved my data. And the best way to improve my code, was to collect more data. This is somewhat a complicated example, but even if I was collecting temperatures from a thermocouple1, the thermocouple has to be calibrated at some point. Code has to determine how to convert the voltage difference into a temperate. It might happen at a low level, possibly directly programmed on the hardware chip itself, but it’s still code that does it. Of course, calibration depends on accurately collecting data on the different electrical properties of the different metals in the thermocouple, so it’s turtles all the way down.

I’m thankful for this unique experience that happened before my career even really got started. It taught me that even at the source, data and code are already intermixed. Data makes your code better, and your code makes your data better.

Data comes from Code

This is a major lesson I find most people in data never learn. When talking to many in the industry I have often gotten the feeling that they don’t understand where the data came from, like data just appears out thin air, magically, into their databases for them to query. In the modern world, data comes from code.

Sure, there are unique examples where an analyst could go write a survey, interview people individually, pencil the results down onto paper, and then pull out a graph paper to manually draw some charts and even use their trusty TI-83 calculator to run some t-test and ANOVA analysis. It wasn’t that long ago when I remember running each of these calculations by hand in college. But I’ve never seen this in business. Only in academia and government where fax machines still exist, would you find such antiquated methods still used.

In every other example of data collection one would see code in the creation process. Website interactions like user clicks and shopping cart orders, obviously comes from code. Financial transactions, interest payments, and anything else to deal with money that data came from code2. Satellite images, security cameras, selfies: code. Every song in Spotify, recordings in your voicemail, audiobooks: code. The entire internet, you guessed it, code. In the modern age, data is entered into a computer at some point, and that means code dictates the data entry process.

Code comes from Data

What’s less obvious, is that code comes from data.

Think about it. Why do people write code in the first place? Nearly every piece of software exists to use, move, display, or transform data. A website is just a system for presenting data—text, images, and videos—to users. Productivity tools like spreadsheets, word processors, and dashboards exist to organize, visualize, and manipulate data. Even video games, which might seem like pure entertainment, are elaborate engines that take input data from players via a controller and translate it into visual and auditory feedback.

The pattern holds across nearly every domain. Financial systems exist to process transactions—streams of numerical data. Medical software manages patient records and diagnostic data. Navigation apps combine GPS data, map data, and traffic data to produce the optimal route. AI models themselves are code that literally learns from data and without training data, there’s nothing to learn, no intelligence to generate. Every modern system, from social media algorithms to self-driving cars, begins with data and builds the code around it.

Most software engineers don’t consciously think in these terms, but the truth is: they are builders of data systems. Every program they write depends entirely on the data it handles. Data is the structure, its source, its meaning. Without data, code has no purpose. Without code, data has no movement. The two are inseparable, but data is the reason code exists at all.

Code is Data is Code

But it’s more than that, code is literally data now. I mean it always was, but it was just easy for companies other than GitHub and Gitlab to ignore before.

Anyone who’s used tools like Cursor and Claude Code will find this impossible to deny. How do you get better results from these agents? Well, you write better code, better documentation, better Agents.md files. The cleaner you keep your code, the further you can push these tools. They are less likely to waste context, forget key information, or spin their wheels in death loops.

Code is literally the data we are training these models on, and yes, garbage in, garbage out. So the better we make our code—the data—the better trained our agents will become. The better trained agent, the better code they help us write. Better results means better data—code. And this feedback loop is accelerating.

Data is code is data.

Why this all matters?

Generally, data can be separated into two camps quantitative and qualitative data. The temperature of the turbine blades from earlier is a quantitative number, it’s something we can apply mathematical equations to and derive new numbers and insights from. Like how I turned the RGB pixels into a °C temperature. This transformative power always made quantitative data inherently more usable.

Abstract art showing data transformation. — Generated abstract art on data transformation, which unironically is an example of data transformation in multiple directions

Qualitative data, like text, pictures, or experiences has always been just as valuable, but it’s harder to get insights out of them. You can read 100 movie reviews and come away with a good vibe for whether people like or dislike a movie, but two different people reading those reviews will have different vibes. Some might be hyper sensitive to negative reviews only focusing on them, while others may naturally gravitate to the positive reviews since haters are gonna hate.

In the beginning of my career I spent a lot of time thinking about this problem. Working in the defect department of Semiconductors we had thousands of SEM images of tiny specks a few nanometers big. The shape of those particles told us a lot about where that defect might have originated. Was it flakey, stringy, or round? Was it rough or smooth? We could measure defect count and size, even run EDX or XPS to determine what elements it was made of. But these other characteristics like smoothness? How do we quantify that?

This isn’t a small problem. In fact, it wasn’t just hard it was impossible, but with modern AI this is not only possible it’s easy. I often don’t even need to train a custom model. I can just start getting useful insights from qualitative data with a simple prompt. “Please rate this paragraph for toxicity on a scale of 1-10”. This is really the reason for the AI revolution we are seeing right now. We might be in a bubble, and the hype is sky high, but the capability to turn qualitative data into quantitative is miraculous.

Code is qualitative data. How do you measure things like cleanliness of code? Verbosity of code? Spaghettiness of code? The best we could do is determine if it worked or not. And by that I mean, did it compile or not, did the unit tests pass, did the integration tests (the happy path) work? Being able to turn it into quantitative data that we can manipulate and transform is the miracle.

So yeah, Data == Code.

Thanks for reading The Data Pioneer! This post is public so feel free to share it.

Thermocouples are temperature sensors consisting of two dissimilar metal wires joined at one end, which generate a small voltage proportional to the temperature difference between the joined end (hot junction) and the other ends. They are widely used in industrial applications because they are rugged, low-cost, and can operate over a wide range of temperatures.

I mean, the whole reason some people use cash is because it’s dataless.

The Data Pioneer

Discussion about this post

Ready for more?