chatGPT and You#

(aka Congratulations/I’m so sorry)

this reading is primarily about the use of Large Language Models (LLMs) to write code. There are many similar arguments related to the use of LLMs to write essays and reports, but that is not the focus of this reading.

You are beginning your Data Science education at a remarkable time. Data Science has always been a rapidly evolving field, but even by its own standards, the changes that have taken place over the last year with the emergence of highly capable Large Language Models (LLMs) like chatGPT, Github Copilot, Llama-2, Bard, etc. are staggering. We are only just beginning to figure out how best to use these tools, and you are perfectly positioned to help shape that future (The “Congratulations!” in the subtitle above).

But because we are still learning how best to integrate these tools into our lives, I have to warn you that as students, you are also in a uniquely vulnerable position (the “I’m so sorry” subtitle). That’s because while it is clear that LLMs are here to stay, and will inevitably play a critical role in the daily life of the practicing data scientist, there is a significant danger that overreliance on LLMs early in your education may stunt your professional development.

The Catch-22#

The problem with LLMs is that when it comes to writing code for real data science work (the kind of stuff you’ll be doing in your advanced courses and in your career), LLMs are error-prone and require substantial supervision. Moreover, there is good reason to think that this is a problem that our current approach to developing LLMs will never be able to overcome. As a result, most practicing data scientists use LLMs like student research assistants—they let the LLMs write the first draft of code, then review and refine the code based on their own expertise. This is extremely useful, but it does mean that to use LLMs effectively, a data scientist still has to learn to program on their own.

And that’s where the danger lies with LLMs: while LLMs are error-prone when it comes to real-world data science projects, they’re shockingly good at the types of basic programming exercises that are the staple of introductory data science and programming courses. And that can create a temptation to use LLMs extensively in introductory classes, precluding the development of a strong understanding of the principles of programming and setting one up for problems down the road.

The Importance of Active Learning#

If there is anything that researchers have discovered about learning, it is that to learn something effectively, one must actively engage with the material. Passive lectures—in which a professor stands at the front of a room and talks to students—may be the norm in schools, but empirically they are actually one of the worst ways to teach. It is only by doing activities in which students get to test their understanding of a topic that real learning happens.

(Ironically, while lectures are not very effective at actually getting students to learn, they are effective at providing students with the illusion of understanding—the false sense that learning has occurred! Here’s one nice illustrative study of this phenomenon)

And that’s why LLMs are potentially so problematic for students in your position—it’s not enough to do the readings about programming and then turn to an LLM as soon as you feel stuck; real learning requires you to spend time in that frustrated, uncomfortable stuck place trying to figure out what about your understanding of the material is inadequate to allow you to move forward.

OK, but isn’t this just the future? Calculators obviated long division after all#

Well… no. No, it didn’t. We still teach kids how to do addition, multiplication, and division despite the fact computers and calculators can do it better, after all. Why? Because it helps them to develop number sense, which is a critical foundation for more sophisticated types of quantitative reasoning and advanced mathematics, statistics, etc.

The same goes for programming—yes, an LLM can easily write a for-loop or a function to find prime numbers, but we aren’t asking you to practice those skills just so you can write a for-loop; we’re asking you to practice those skills to help you develop a comfort with solving problems through algorithmic thinking.

Reliance on LLMs Is Particularly Dangerous in Data Science#

As I mentioned before, part of the reason to not become overly dependent on LLMs early in your education is that it will prevent you from learning the skills you will later need to supervise them when doing more complex work. This is true in all domains of programming, but it’s especially true in Data Science.

In many contexts—like web design or app development—you can start to evaluate whether an LLM has successfully completed a task by looking at what it created. If an LLM creates a website for you, and the website looks correct and does everything you want it to do, then, well… the LLM did a pretty good job! You wouldn’t want to use this approach to launch a website for a big company—there are lots of corner cases that would be hard for you to verify work by guess-and-check, and it’s hard to verify your site is secure—but it’s at least a start.

But the same cannot be said for data science. As data scientists, we are in the business of generating new knowledge, which means that we don’t know what the output of our models should look like in advance. Sure, we have some sense of what reasonable outputs look like—an analysis that suggests smoking prevents cancer going to raise a lot of red flags—but the reason that the FDA forces drug companies to run clinical trials before approving drugs and Google runs A/B tests whenever they change how search results are shown is that we don’t know what drugs will work/what users will respond to in advance.

As a result—unlike in web development or writing app widgets—we can’t evaluate whether an LLM has correctly written the analysis code we asked for by looking at the results. Rather, in data science, our confidence in our conclusions depends almost entirely on our confidence in how our results were generated, and that can only come from reading, testing, and understanding our own code.

Even In Software Engineering, “It’s Complicated”#

Even among software engineers, most of the professional software engineers I know don’t use LLMs to write code—they feel LLMs are helpful for tasks like categorizing, organizing, or compiling notes in meetings, tasks for which it is easy to check that the LLM has performed well—but they find it takes longer to explain any real programming task to the LLM than it would take to just do write the code given the need for significant design work and an understanding of how changes fit into large existing code bases. Moreover, many are also prohibited from using LLMs due to IP leakage risks. So even if you don’t see yourself as the kind of data scientist who plans to generate new knowledge, you still want to be cautious about overreliance on LLMs.

So Where Does That Leave Us?#

As I said at the top of this reading, we’re still learning how best to fit LLMs into our lives. They have clearly emerged as a critical tool in the data scientist’s toolbox, and one I am very confident you will be using regularly by the time you graduate from the MIDS program.

However…

For all the reasons laid out here, I hope you can see why overreliance on LLMs early in your education may be extremely detrimental. Indeed, I’ve had several recent MIDS graduates say to me how glad they are that they learned to program just before these tools became available, not because it means they don’t need LLMs, but because they feel it has set them up to use them effectively.

In your first year, you may find that different professors have different rules around whether the use of LLMs are allowed. For some classes—like Data Management, where a lot of the “programming” you are doing just entails calling APIs for cloud services, LLMs may prove really helpful. But even where the use of LLMs is allowed by the professor and does not constitute an honor code violation, my strong suggestion is to err on the side of under-using these tools, at least for your first year at Duke.