Understanding Data Science pipeline and it’s general terms.

Praffulla Dubey
2 min readJul 3, 2019

What is Data Science?

Data science is the extraction of relevant insights from data. It uses various techniques from many fields like mathematics, machine learning, computer programming, statistical modeling, data engineering and visualization, pattern recognition and learning, uncertainty modeling.

What is Data Science Pipeline?

Data science pipelines are sequences of processing and analysis steps applied to data for a specific purpose. It is useful in production projects, and it can also be useful if one expects to encounter the same type of business analysis.

Steps:-

Import Data: - The most important thing is to obtain the data, so gather all of your available datasets (which can be from the internet or external/internal databases/third parties) and extract their data into a usable format (.csv, JSON, XML, etc.).

Cleaning of Data: - This phase of the pipeline is very time consuming. Most of the data comes with missing parameters or duplicate values so it becomes important to clean the data and take the data that is only required.

Data Munging/Wrangling: - Data wrangling or data munging, is the process of transforming and mapping data from one "raw" data form into another form to make it more appropriate and valuable for a variety of purposes such as analytics.

Data Visualization: - Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed.

Modeling the Data: - Data modeling is the process of producing descriptive diagram showing relationship between various types of information stored in the database. These diagrams can be various graphs like scatterplot, bar plot, histogram and etc.

Cross Validation: - Cross validation is the process of training using one set of data and testing it using a different set.

--

--