How to Develop the Most Effective Data Science Workflow
By School of Professional Advancement | Date Monday, Sep 17th, 2018
When the Harvard Business Review article “Data Scientist: The Sexiest Job of the 21st Century”* was published in 2012, realizations for the practical applications of big data were just starting to take shape. The potential for data as legitimate science seemed endless. It was, after all, just a few years prior that the job title data scientist was used seriously, and it spread like wildfire from startups to conglomerates.
Fast forward to present day and big data usage can be found in seemingly every industry you can think of. Harvard was right: it is a sexy job and organizations desire the right person to develop the most reliable data science workflow possible—this is what streamlines technological development, what identifies all the bumps and kinks, what is crucial and in-demand. This is what Harvard meant.
There is not a right data science workflow, but there is one (or more than one) that can work best for you, your team, and your goals. Your workflow might even vary slightly from project to project. You may have to rearrange some steps and processes depending on the situation. But it should always adhere to the same principles to ensure continued data collection and create high-quality products efficiently and consistently.
Here is how to develop the most effective data science workflow:
Step 1: Identify an Area of Improvement
It might be obvious that your organization has an issue that needs fixing. Or, you might happen upon it accidentally. No matter how you do it, identifying the problem is the foundation your data science workflow is built on. In 2008, Jonathan Goldman, one of LinkedIn’s earliest data scientists, identified an area for improvement by formulating and testing theories. Goldman ultimately developed the feature that allows users to find connections they do not know directly, but are in the same network as people they do know. He found that by displaying potential connections as ads, the site was able to generate sizable web traffic. It didn’t take long for him to bring it to his CEO and implement it as a standard feature.
Step 2: Import and Analyze the Data
Now that you have a task and a goal, you need to set in motion the mechanisms to achieve it. A good place to start is to import and analyze your data. Data can be pulled from local device files, aggregated by complex algorithms, or gathered through whatever your preferred method is. Once collected, dive right in and explore. See how you can combine incomplete data sources, organize them into datasets and data frames, and clean the data to rid them of any glitches or inaccuracies. Make sure all data types are organized and formatted correctly. This will allow you to really process the data mentally, and identify unique trends, patterns, or values.
Step 3: Build and Test Models
How are you going to build out your data? Time to start modeling. Build a baseline model and add features throughout the progression of the project as it becomes more complex. Tweak and optimize the model structure as necessary. Test the models and consider how to visualize the results; you’ll be able to better analyze the stats and the performance, conduct improvements, and see ways to productionize your data.
Step 4: Productionize
You’ve tested your data and developed prototypes. You’ve logged them and stored them so that you can make more accurate predictions on any future modeling efforts. As you near the end of your workflow, don’t get complacent. Things can still go wrong. Be sure to consider any bugs and build in alerts to head off any pitfalls. Now you’re ready for production. Remember Jonathan Goldman? His observation about users’ networks became a standard feature and achieved a 30 percent higher click rate than other ads, generating millions of pageviews.
It’s important to remember that a data science workflow is not a one-size-fits-all solution. Think of these steps as large brush strokes. Between them, you need to paint finer lines specific to the project or product at hand.
More than half a decade removed from when Harvard hailed data science as the hottest job of the future, it’s still en fuego. Big data isn’t going anywhere. In fact, they’re getting bigger, more intricate. This requires the right knowledge, practice, and IT training to implement the most effective workflows that deal with the most complex data.
If this in-demand field is something you want to learn more about, request information today about Tulane University’s online Master of Professional Studies in IT Management. We can help you become a better data scientist or even provide you with the skills you’ll need to lead a team of IT professionals.