Duplicates Treatment

General information

Brick provides a possibility to inspect and remove duplicated rows.

Description

Brick Location

Bricks Data Manipulation → Filter Duplicates Treatment
Bricks Analytics → Features Engineering Duplicates Treatment

Brick Parameters

  • Outcome
    • Specifies dataset type to be returned. Few options available here:
    • Select Duplicates Only duplicated rows
    • Denote Duplicates All dataset with an additional boolean column duplicates_denoting that emphasize duplicates
    • Remove Duplicates Dataset without duplicated rows
  • Remove Duplicates Strategy
    • Only available for the output type "Remove Duplicates".
      Specifies the strategy that would be applied to delete duplicated rows. Few options available here:
    • Keep First First row from the duplicated ones would be kept
    • Keep Last Last row from the duplicated ones would be kept
    • Remove All All duplicated rows would be removed
  • Duplicates Identifier
A column that would be used as a unique key. It is possible to choose several columns by clicking on the '+' button in the brick settings. At least one column should be specified.

Brick Inputs/Outputs

  • Inputs
    • Brick takes the dataset.
  • Outputs
    • Brick produces the result as a new dataset, with an additional column duplicates_denoting in the case of outcome Denote Duplicates.

Example of usage

Let's consider the dataset from the binary classification problem . The general information about the dataset is represented below:
  • passengerid (category) - ID of passenger
  • name (category) - Passenger's name
  • pclass (category) - Ticket class
  • sex (category) - Gender
  • age (numeric) - Age in years
  • sibsp (numeric) - Number of siblings / spouses aboard the Titanic
  • parch (category) - Number of parents / children aboard the Titanic
  • ticket (category) - Ticket number
  • fare (numeric) - Passenger fare
  • cabin (category) - Cabin number
  • embarked (category) - Port of Embarkation
  • survived (boolean) - True/False
notion image
Let's say we are interested only in unique combinations of Pclass, Sex, and Survived. To get such a dataset we should select Outcome: Remove Duplicates with Keep First or Keep Last strategy and Duplicates Identifier: Pclass, Sex, and Survived. The resulted dataset is shown below:
notion image