Getting into data science


TLDR: People get into data science in a variety of ways, but tend to have an idealized image of a diverse discipline. To get into the field, first focus on the basics and figure out how you can create value with your new skills.


A Story

A meeting room full of executives sits around a table. Profits are down, and reliable customers have not been purchasing like they used to. Frantic, the executives call in their secret weapon- the data scientist. She walks in the room and calmly pulls up some visualizations of customer behavior. She first shows a scatterplot of customers, their volumes, and their profit margins. There isn’t anything to see. But then, by animating and transforming the data to display a third dimension, she exposes a business opportunity outside of the normal customer profile and estimates there are millions of dollars of revenue if they simply advertise in this new market. The executives are floored- someone without their vast experience in the business was able to find something to save them purely by looking at numbers. She saves the company with her mathematical genius.


...


I think a lot of people want to move into data science with the thought of being the hero in the story that I mentioned above. While this could be a good goal, I think it’s also useful to douse this in a heaping pile of reality- in practice, situations and stories like this are really few and far between. Data science is still an interesting and rewarding field, but it might help if there are more realistic expectations. As one data point of what this could look like, here’s my (nontraditional) story of how I got into the field and advice I would give to anyone looking at getting into data science.

My story

I have been interested in data science and adjacent fields since high school. An elective class in mathematical problem solving and a few books really solidified my interest in applying math to everything. Hearing about cutting edge research, with topics like genetic algorithms and simulating whole economies with computers just seemed like something so fundamental and interesting to all aspects of human life.

While my interest was certainly piqued, all of the people I read about in the field were PhD researchers and committing to another decade of schooling as a high schooler seemed rather intense. Instead of going down the PhD path, I chose to pursue chemical engineering after some solid advice from some relatives. Chemical engineering coursework was all well and interesting, but I still found myself most interested in the computational simulation and mathematical aspects of the problems.

For full-time work, I joined BASF for a rotational program fully expecting to start off as an engineer and eventually pivot into business with an MBA. Hopefully, I would be able to do some coding side projects in order to learn more about the data science field while still working full time. Through internal networking, I found BASF’s North American data science group and proposed an 8 month rotation with them as a part of the rotational program. They agreed, and then in the lead-up to the rotation I did a fury of advance studying on the basics of machine learning in R to be ready for the job. The job was interesting and enjoyable, and allowed me to combine chemical engineering knowledge with the new machine learning skills I was developing. The 8 months of the rotation were intense and I continued to learn tons on the job, but they were impressed enough to offer me to continue working with them permanently afterwards.

Since then, I’ve continued to function as a data science practitioner particularly aimed at bringing machine learning to the chemical manufacturing space. I completed a Master of Information and Data Science from UC Berkeley while continuing to work, and these days I still scope and implement side projects to continue learning new things.


Some advice to get started

Along the ride, I’ve learned a lot of things and many of those things were learned the hard way. Here’s some general advice for what I’d tell someone beginning in their journey into data science:

Don’t get hung up on being “the ideal data scientist.”

I think a lot of people (myself included) think of data science as a sexy thing and idealize situations like the story I mentioned at the top of this post. The expert that comes in and instantly can analyze any data is a good story, but expecting to be an expert in every field and every data and domain is unrealistic. A better perspective than the idealized data scientist is that you are a problem solver who has a lot of data skills. To reinforce this perspective, I tend to refer to myself as a “data science practitioner” rather than a data scientist.

Learn how to learn and show off that you’ve learned something.

The data science field evolves at an incredibly rapid pace, and has too many facets to be up-to-date on all of them. More valuable than focusing on being an expert at everything, figure out how you can quickly learn new topics. If you can identify a topic, think of a project that someone who knew this topic would be able to do, and then undertake that project, you are showing that you understand all aspects of this quick learning process. For instance, if I were learning time series forecasting, I would think that someone who knew time series ought to be able to take census data and forecast populations in major cities across the globe while taking into account a few different covariates. Going through and actually performing this project would demonstrate you know time series but also that you know how to pick things up and apply them. All good skills!

Focus on the basics.

There’s no shortage of articles about the latest and greatest new neural network architecture or generative modeling technique. In many places in industry, you likely won’t touch these until you’re much more senior. A solid understanding of Python, SQL, R, and statistics is enough for you to provide immense value to an organization. It may not be as sexy to say “I made a logistic regression model in Python” as “I used PyTorch to create a encoder network with short term memory from data in a graph database”, if both give the same amount of benefit... you probably wasted a lot of time with the extra complexity to sound cool.

Expect to do a lot that isn’t machine learning.

There’s a lot more to data science than modeling. Expect to have to do that. I really recommend finding or creating your own data sources in lieu of using Kaggle after you have the basics under your belt, as you’ll much better simulate what the real world is like. More than likely, you’ll find projects more interesting to you in the process!

Never lose sight of the real-world value of what you’re doing.

As a data science practitioner, most business stakeholders both don’t care about whatever you do and also don’t understand it. What they care about is KPI improvements- are you creating new revenue? Are you saving costs? Never lose sight of this fact in your communications and how you structure your work- make sure what you do is clear and actionable to help real business objectives.

Domain expertise is crucial to success.

Outside of research, data science rarely occurs in a vacuum. Understanding the domain where it’s being applied, whether this be for advertising or manufacturing or elsewhere, is essential to figuring out how to solve the problem at hand. If you have domain knowledge - awesome! Use it in combination with your data skills. If you don’t, make sure you can talk with domain experts as they will be able to help you understand a lot more about the data and context which will inform your analytical decisions.