Working with data we can find some qualities in a dataset. For example, noise and outlines, missing values, duplicate values in a dataset are parts of data quality. Noise refers to the modification of original values. It distorts the original values and makes the dataset unusable. On the other hand, missing values in a dataset also create a row of a dataset unusable. And finally, duplicate values like duplicate rows make a dataset large and hampers the processing of a dataset as duplicate values aren’t helpful in a dataset. It only creates a storage problem.
Dealing with this kind of dataset is really challenging. Noise can be removed from a dataset by finding out the outliers. The missing values can be two types one is null values and the other one is an object in the place of numbers. We can remove them and replace them with mean values. Mean values are the closes values to original values as it is the average value of a specific column. Duplicate value can be removed. In a dataset, we can remove the duplicate rows. Using pandas in python it is easy to drop all the duplicate rows in a column.