Brick provides a possibility to inspect and remove duplicated rows.
Bricks → Data Manipulation → Filter → Duplicates Treatment
Bricks → Analytics → Features Engineering → Duplicates Treatment
- Select Duplicates Only duplicated rows
- Denote Duplicates All dataset with an additional boolean column duplicates_denoting that emphasize duplicates
- Remove Duplicates Dataset without duplicated rows
Specifies dataset type to be returned. Few options available here:
- Remove Duplicates Strategy
- Keep First First row from the duplicated ones would be kept
- Keep Last Last row from the duplicated ones would be kept
- Remove All All duplicated rows would be removed
Only available for the output type "Remove Duplicates".
Specifies the strategy that would be applied to delete duplicated rows. Few options available here:
- Duplicates Identifier
A column that would be used as a unique key. It is possible to choose several columns by clicking on the '+' button in the brick settings. At least one column should be specified.
Brick takes the dataset.
Brick produces the result as a new dataset, with an additional column duplicates_denoting in the case of outcome Denote Duplicates.
Let's consider the dataset from the binary classification problem . The general information about the dataset is represented below:
- passengerid (category) - ID of passenger
- name (category) - Passenger's name
- pclass (category) - Ticket class
- sex (category) - Gender
- age (numeric) - Age in years
- sibsp (numeric) - Number of siblings / spouses aboard the Titanic
- parch (category) - Number of parents / children aboard the Titanic
- ticket (category) - Ticket number
- fare (numeric) - Passenger fare
- cabin (category) - Cabin number
- embarked (category) - Port of Embarkation
- survived (boolean) - True/False
Let's say we are interested only in unique combinations of Pclass, Sex, and Survived. To get such a dataset we should select Outcome: Remove Duplicates with Keep First or Keep Last strategy and Duplicates Identifier: Pclass, Sex, and Survived. The resulted dataset is shown below: