Infographie : 5 Piliers de la DATAScience

Implicit in the above statements is the second fundamental concept of data science: Know your data! In order to know which data will be best for a project, and which features to select, we clearly must know our data. But I mean something more than that. I mean something that is better called “Data Profiling.” In the process of data profiling, we examine many aspects of the data: min/max values, aggregate values (such as mean, median, sum,…), the list of distinct data values (if we are working with defined discrete data attributes), data histograms and distribution parameters (quartiles, deciles,…), physical units, scale factors, interdependencies (e.g., derived parameters, such as C=B/A, where A, B, and C might all be included in the data set), missing values, NULL values, indices (used to ID the data object, but not a property of the object), and more. If you are working with labeled data (for a classification, predictive analytics, or supervised learning project), then it is imperative to identify which data attribute is the class label or predicted variable. Another aspect of the “know your data” concept is to remember to focus on actionable data (i.e., parsimonious data models are preferred, sometimes referred to as Occam’s Razor, or Einstein’s rule: “Models should be made as simple as possible, but no simpler” – avoid the spherical horse!). By focusing on the data elements and output variables that inform, guide, and provide insights regarding your end-goal, you are consequently kept on a path where distractions and the noise-to-signal ratio are reduced.