General information
Brick provides a possibility to inspect or remove duplicated rows.
Description
Brick Locations
Bricks → Analytics → Duplicates Treatment
Brick Parameters
- Outcome
- Select Duplicates which select rows which are duplicates.
- Denote Duplicates add an additional boolean column duplicates_denoting that emphasize duplicates.
- Remove Duplicates remove duplicated rows from dataset with a chosen strategy
Specifies dataset type to be returned. Few options available here:
- Remove Duplicates Strategy
- Keep First - first row from the duplicated ones would be kept
- Keep Last - last row from the duplicated ones would be kept
- Remove All - all duplicated rows would be removed
Only available for the output type "Remove Duplicates".
Specifies the strategy that would be applied to delete duplicated rows. Few options are available here:
- Duplicates Identifier
A column that would be used as a unique key. It is possible to choose several columns by clicking on the '+' button in the brick settings or fewer columns by clicking on the “-” button. At least one column should be specified.
Brick Inputs/Outputs
- Inputs
Brick takes the dataset.
- Outputs
Brick produces the result as a new dataset, with an additional column duplicates_denoting in the case of outcome Denote Duplicates.
Example of usage
Let’s assume we have data about house prices dataset. It consists of many columns, they are:
- id (category) - unique id of house
- Neighbourhood (string) - name of region where house take place
- YearBuilt (int) - year when house was built
- RoofMatl (string)
- GrLivArea (float) - house living area
- YrSold (int) - year when house was sold
- SalePrice (float) - house price
data:image/s3,"s3://crabby-images/70411/70411567410a77d47ecdb0edf5757915447777a9" alt="notion image"
Firstly we want to perform Remove Duplicates option of the Brick, so we choose a column we want to perform duplicates removal in our case it is for example Neighbourhood and we take Keep First option.
data:image/s3,"s3://crabby-images/de3f7/de3f723df5c25be95841e643201c05e33147180b" alt="notion image"
Let’s move to result:
data:image/s3,"s3://crabby-images/eb33f/eb33fb1f1b36a99a1273c5150fdfb9555d13072f" alt="notion image"
Secondly let’s try option where we would select duplicates on column Neigbourhood
data:image/s3,"s3://crabby-images/e73ec/e73ec0b222ccf8e8e17acca3a372d45a24877033" alt="notion image"
The result is the same as an dataset on input, it’s because all rows are duplicates of other rows in column Neigbourhood:
data:image/s3,"s3://crabby-images/2c5b4/2c5b432148dd2009106b5764c492553c71e1e172" alt="notion image"
Thirdly, let’s move to the option of Denote Duplicates, there for instance we will take id column to show the difference.
data:image/s3,"s3://crabby-images/86bb6/86bb61bda8b2b10850aaba12ee2f6f27c3f3d004" alt="notion image"
The result is straightforward because all values of column id are unique:
data:image/s3,"s3://crabby-images/c02e0/c02e0a8a8250a641b78c4597d6a0502c1f7b5ea5" alt="notion image"