+ 1

Steps of cleaning & Exploring a dataset

I would like to know during the phase of cleaning & Exploring a dataset, do we have ONLY three steps to do (missing data - Outliers - Unnecessary data)? And when detecting outliers should we process each column to see if it contains or not or do we just take a random column and that's it? Also, for the last step (unnecessary data), how can we know if a column is unnecessary or not? is this step just based on determining the unnecessary columns or is there something else to do?

26th Dec 2021, 11:26 AM
Kaoutar Boulguid
3 Réponses
+ 4
How you treat missing data and how you define outliers is a matter of your own choice: How are the data obtained? What is the research question? However you should always be able to justify your decisions and be transparent if you remove data. All in all, there is no recipe or fits-all method.
26th Dec 2021, 11:45 AM
Lisa
Lisa - avatar
+ 3
Usually there should be a documentation telling us what the data in the dataset mean. We have to study this documentation to understand what we can use the data for. The research question is the reason why we do a data analysis at all. We ask ourselves: What can we learn from these data? An example: Here on sololearn, they use the titanic data set for demonstrations. Research questions could be: * Which variables influenced the probability of surviving? * Does the passenger's age influence whether they survived the accident?
26th Dec 2021, 2:57 PM
Lisa
Lisa - avatar
+ 1
Thank you for your reply. So from what I understand from you, at the outlier stage I am free to treat each column and distinguish whether or not I have an outlier. Please can u say me how I can develop the research question because on the internet we find a dataset but I don't know how I can treat this dataset (what do I have to process and what are the questions to ask?) - I'm a beginner in this field :)
26th Dec 2021, 2:47 PM
Kaoutar Boulguid