Still unsure of the correct Career Path of a Data Scientist? Check out SwitchUp’s Data Science Career Path post.
Data Science lies at the heart of what is sweeping the tech world lately. You can barely turn on your television, your favorite podcast, or stand at the water cooler (are those still a thing?) without hearing about the latest machine learning algorithms that will soon know your morning routine better than you do.
Ads learn to target your online shopping sessions, more seamlessly than ever, in an attempt to prevent you from forgetting that thing you were thinking of buying. Global powerhouses are leveraging database marketing to try to reach more users than ever before, capitalize on profits, corner market trends, and even influence elections.
The ubiquitous presence of connected devices in our day to day lives means that we now have more data than ever before. Each step we take is producing information, from our phone’s GPS tracking where we are and where we go to our internet history, to what we bought at the grocery store. The world has always been filled with data, but factors like storage constraints and computational bottlenecks mean we’ve been limited in how we can wield it. As the tech world evolves and storage and processing technology become more advanced, we can begin to take advantage of enormous datasets by analyzing, understanding, and ultimately predicting events and making decisions for the future.
All in all, data means power. Data is the power of leveraging experience and activity of billions of people and making decisions based on informed analysis. It’s the power of identifying patterns and predicting the future and understanding the world and our universe. Data Science harnesses this power.
Python is one of the most popular programming languages for Data Science, and arguably, the most popular one. It's beautiful, simple, and fun to write. It’s the perfect programming language to get started with and is considered to be the most popular introductory teaching language at top U.S. universities.
One of the greatest advantages of Python, specifically, is that it has a broad number of applications aside from Data Science. For example —
Popular frameworks like Django, Flask, and Tornado are polished, mature, and evolve with web standards. Pinterest, Lyft, and Instagram are all built on Python foundations.
Scripting and automation
It has all but replaced Bash and Perl, giving us the chance to use a more robust and expressive language for scripting and system automation. Python is the backbone of Ansible, one of the industry’s most popular system provisioning and maintenance tools.
Data cleansing and normalization
Those spreadsheets with endless holes, complex formulas, and frustrating inconsistencies are made easier with Pandas and Numpy. Python makes it easy to clean up a dirty dataset to maximize time with analysis rather than headaches.
Widely known libraries like PyQT are available to develop desktop applications with GUI’s that you create, with others like Kivy and PyMob on the iOS and Android end.
Machine Learning and AI
Tools like scikit-learn made specifically for data mining and analysis, and PyTorch for neural networking, make Python a great language for ML and AI. You can even use it with the popular Google library TensorFlow which recognizes objects within images.
Choosing a language that has a large number of appliances is a definite advantage. If you jump from one space to the other, your experience with the language transfers with you. For example, we always recommend that our Data Science students learn about Web Development. That way, once you're satisfied with your data analysis and ready to publish your results, you'll probably want a snazzy web tool for that — be it a webpage, interactive chart, or API interface. Or you can transfer your knowledge to something even as simple as concatenating the first and last names of your company’s 10,000 clients (because Karen likes it that way and there’s nothing you can do about it). Throw a script together to read in the file, edit it, and save it — automating it all with Python.
How To Get Started
As the Data Science field grows to accommodate everything we throw into its definition, it's difficult to find a clear and proven path to start down. That's why we're here, and will highlight 5 key steps that we learned from our experience to help you get started.
We have learned that Data Science is a broad term. It encompasses everything from data collection, cleaning, analyzation, visualization — to actually making decisions and the sometimes grueling process of further deciding whether that time and effort was worth it. It involves programming and math, as well as some level of domain expertise. If you're working on a new, transformative way to sequence human genomes, you’re going to need a foundation in biology.
It's important to then accept and appreciate how general of a discipline Data Science is and to try not to feel overwhelmed by it. Start by distinguishing what the different building blocks of Data Science are (more on that below) and at the beginning, pick up the ones that make more sense to you. Gravitate toward the ones you suspect you might enjoy more because that’s where you’ll likely find your niche.
Computer programming is a key activity as a data scientist. It's the discipline that will glue the Data to the Science. Programming is merely a mean, not an end. We use coding to tell computers what to do —
“Load this 200,000-row spreadsheet and drop any rows with blank data”
“Show me only the people within a 50 km radius of city center”
“Convert every invoice in these folders from rupees to US dollars”
Don't underestimate the importance of being good at coding. You might be a great scientist with the math background and domain knowledge, but if you're not good at coding, you'll end up with inefficient solutions to do otherwise simple processing. We've had students show us scripts with thousands of lines, with inefficiencies and nightmarish nested looping, resulting in hours of running time. Programming fundamentals made easy work of these monolithic scripts, as well as our decision to try and spread the word as wide and far as possible (since who wants nightmares).
Python makes it super simple to get started. It has both a concise and readable syntax (which is not always easy to come by) and can be added to any platform. There are plenty of free books and resources to keep you on your toes, as well as our fully staffed and immersive Data Science and Web Development courses if you want to jump in head first.
We wouldn’t be revealing any secrets by saying that Data Science involves math. You've probably seen it from other posts and articles listing 100 books about linear algebra, statistics or calculus. We know first hand that it’s not hard to grow disconcerted, especially in the face of prerequisites of an unfamiliar space. But, just take a breath, unfurrow your brow, and don’t worry. Let’s look at two scenarios:
If you already know math, you're in a comfortable position. You should now focus on understanding the subset of math models required for Data Science and dig out your statistics and algebra books. It'll likely take a few hours to catch up and remember most of it.
In this position, what you should really pay attention to, is to the coding part. Most math algorithms and models are already coded by other libraries, most mature and with excellent documentation (see the succinct Augmented Dickey-Fuller unit root test). The edge you bring being already adept with mathematics will allow you to understand what you're working on much quicker, but at the end of the day, you'll end up using the models already written.
Good news! We’re pleased to report that you can learn everything on your own. The key is to plan and organize your career path so you don't get bored and demoralized. Our advice here is to start by doing, and by using the tools.
Go find a dataset that piques your interest — something that would be fun to work during weekends or long night hours, and start playing around with it. It can be about Football, Basketball or The Marvel Universe. Once you get a dataset that intrigues you, start working on it and analyzing it. You are the one asking the questions —
“What's the average number of shots taken by defenders in the Premier League?”
“Who was the midfielder with more goals this last season?”
“Could Loki and Thor have been friends if Odin didn’t take him in?”
You'll inevitably need to learn about statistics, like averages, means, medians as well as other mathematical concepts as you become inevitably fluent in Stack Overflow. Once you understand how you use these tools, which you’ll learn are all that the most complex algorithms and models are built upon, you'll have a better feeling of how they are implemented and you'll be closer to understanding the internals of it.
If you want to learn how to drive, will you start with Physics 101 and the internals of internal combustion engines? Probably not. Generally, you'll start by using the car, by driving it. To learn more about how it works so that you can work toward its peak performance, you'll naturally explore more about the internals and how it works under the hood, but not right away, and that’s okay.
We linked a few datasets in our previous section, but, where are they coming from? Learning how to get data and cleaning it is a fundamental task for every data scientist, and it's arguably (and sometimes, unfortunately) the activity that takes most of our time.
When looking for the right stuff, you'll usually find data that is either structured or unstructured––
Data formatted and minimally sanitized. Usually in text formats like CSV, JSON, and XML. A good example is the one used by Allen Downey in his great (and free) book Think Stats. The dataset comes from the National Survey of Family Growth, by the U.S. Centers for Disease Control and Prevention (access the data set here) and as you can imagine, is properly tabulated. There are still inconsistencies and errors as with any large dataset, but it's at least formatted in a simple way to import and manipulate.
This is usually data scraped by data scientists themselves. It can get messy. For example, using web scraping (like Python’s Scrapy library), web services (like API endpoints) or everything else (such as importing Excel spreadsheets, or performing OCR on photographs, or parsing PDF’s, oh my). This is the type of data that we want to avoid, as it’s significantly more complicated to retrieve, process, sanitize and manipulate. Especially at scale. But sometimes, there's no other way around it and we make do.
When it comes to retrieving and cleaning data, the process involves a lot of programming and less domain knowledge. It's all about building parsers, efficient pipelines, and writing tests to ensure they work. Learning to code and to use the proper tools for the job at hand like Scrapy, Pandas, and Requests, is instrumental.
The way you present your reports and analyses can make all the difference. Creating compelling and attractive visualizations is sometimes the difference between a successful presentation and a disastrous one. Many think that just having found good numbers and solid conclusions is enough. But if it’s not presentable, it’s likely useless. There are three key points to consider when building your presentations and reports —
Your reports will be based on a few key discoveries or conclusions, such as “we've identified that senior customers are more likely to visit the store on Tuesday’s". Focus on building your reports around that conclusion, everything else is noise. I'm personally keen on concise reports that focus only on the one finding that they're trying to communicate. If you want or need to build long reports, try to structure them showing the most important concepts at the top. If your reader has only 5 minutes and just glances over the first page, you've at least communicated what’s most important.
It might be a box plot, a scatter plot, or maybe both apply. It's important to pick the right tools to visualize your data. Selecting the incorrect chart can make the entire analysis useless. A recommendation here is to vet your visualizations with multiple people, without giving them much context. Then, see which ones they can understand with the least explanation, which are most intuitive. Whenever you read articles or reports from others, stop at any visualization and analyze what other ways you've employed to communicate the same information, and think about any other ways you would have conveyed it differently.
Creating beautiful content is important to create a better bond with your audience. Above, we emphasize the type of visual to convey your data, while the appeal refers to the aesthetics and feel. It’s like walking onto the stage at a tech conference wearing a tutu, they shouldn’t judge, but they will, and it can detract from your bottom line. Using the right colors, fonts, and text sizes are just as important to communicate your goal. Basic design rules apply. Fortunately, Python tools like matplotlib, seaborn, and bokeh, will allow you to customize all your visualization to create elegant presentations.
We can graph a correlation of wasting time and regret. We know that the whole Data Science universe can look intimidating, with all the math, the programming, the new buzzwords. But if you just get started with one thing, you'll find your way through it. We recommend starting with programming, as it's the most interactive and there’s lots of yummy instant satisfaction. You can start playing right now, creating visualizations, running analyses, playing with the Star Wars API. In the process, you'll dig deeper into other topics like more advanced programming and functional mathematics, while ultimately enriching your field with new discoveries. Next, you’ll be automating your boss’s job away, and maybe all of our boss’s.