Today, I am going to cover why I consider data science as a team sport?
From data science use-case identification to the deployment of the models in production, so much goes into data science projects.
It’s really rare to have all these skills in one person to deliver the data science project end-to-end.
And it depends on the ecosystem you are working in, whether its a start-up or an enterprise, size of the team, data maturity et cetera…
So what is it like to work in a data science project? What is the high-level process? what roles are involved and who does what?
Let’s find out…
Let’s have a look at the high-level steps in a data science project.
Most of the time, you may start with defining the problem statement, but many a time, you may not have a problem at hand to solve.
In that case, you may need to first identify the use-cases for data science and may also need to qualify those use-cases.
If you have identified and qualified many use-cases, then you may also need to prioritize them based on their return-on-investment (ROI).
After identifying the use case, you define the problem statement, you gather business or domain aspects and you start building your understanding around the data available.
You design a high-level approach for the solution, discuss and define the key-performance-indicators (KPIs) with the business sponsors.
Most of the time, it is worthwhile to start with a prototype or proof of concept (POC) rather than involving in a full-fledged project.
Building prototype is a way to assess the feasibility of the data science project before investing heavily, here you do all the steps required in a data science project but on a smaller scale.
Once you have built a prototype and stakeholders give a go-ahead, you start the project formally.
You collect and explore the data, you validate and clean it, you apply transformations to make the data ready-to-be-consumed for core data science tasks.
Then you build the necessary features, split the train, validation and test data-set and also train, validate & tune the model.
Above steps are iterative, which means you would be continuously munging the data, building and modifying features; training, validating and tuning the models until you get the required results.
Once your model provides required accuracy, you deploy it in an environment to get the feedback from business stakeholders.
After getting the positive feedback, you build required dashboards for business KPIs and make your data science solution live.
Once your model is in production, you need to monitor the data and model performance over the period of time for any performance degradation.
If a model performance goes down, you do a root-cause-analysis, replicate the issue in a different environment and repeat above mentioned steps to identify and resolve the issue.
So this is the end-to-end process of a data science project.
Now, let’s have a look at the different roles in data science teams.
Please note that these roles may vary based on many factors.
For example, in a start-up, one or two people might be doing all the stuff.
While in an enterprise, you may even have more specific roles that I have mentioned here.
Business sponsor is the stakeholder who is funding the project. He is involved at the starting and at the end of the project.
Data science leader manages the project and the team to deliver the project as per the business sponsor’s expectations.
Data engineer collects, processes and refines data as per data scientist’s requirement, while the data scientist works on core data science tasks of the project like feature engineering, training, evaluating model performance et cetera.
DevOps engineer looks at the deployment aspects of the data science project, like automating the preparation of the build and its deployment in an environment iteratively.
You may need a cloud engineer for your project if you are using cloud services, which are available in the form of IaaS, PaaS and SaaS.
Once you have deployed the model in production, you need a business intelligence (BI) engineer to build a dashboard where business can look at the results and measure the performance against the KPIs.
If you have many data science projects which are sharing and reusing data and infrastructure components, you also need a data architect to do it in the most efficient way.
As mentioned in the starting, I would like to repeat here that these roles may vary based on many factors in different organizations.
So here we looked on a high-level, who does what in a data science project.
I will cover more details of this aspect in an upcoming episode, where I plan to
provide a ‘process vs role mapping’ for a data science project.
You may notice here that the data science projects require a variety of skills, which are quite uncommon to acquire by a single person unless he is doing it for years.
As a beginner, I would suggest you to build a T-shaped skill-set, which means building depth in a particular area, maybe core data science or core data engineering tasks.
And having breadth in all other related areas which we discussed earlier.
Why I say so? Because I have seen enough data scientists sitting and waiting for data to be available in required format, before starting their work.
Some data scientists find it really difficult to work on cloud, some struggle with writing an efficient pipeline or version maintenance.
Having just enough understanding of these areas can take you a long way, and if required you can perform these tasks yourself rather than waiting for an expert.
In my view, it will help you in getting the job faster and will also make you quite effective in the team.
So, this is it for now.
I hope you found this article useful.
Let me know your views in the comments section.
If you liked this video, please subscribe to my channel to get an update whenever I upload the new content.
Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.