The #rstats datawizard package (from the @easystats4u ecosystem) has two very useful functions to deal with duplicates.
* data_duplicated: Extract all duplicates including the first, unlike duplicated() or dplyr::distinct()
* data_unique: by default selects the "best" duplicate
data_duplicated() also contains an additional column reporting the number of missing values for that row, to help in the decision-making when selecting which duplicates to keep.
data_unique() can keep either the first, last, or "best" duplicate. The "best" duplicate (default) will pick the row with the smallest number of missing values. In case of ties, it picks the first one, as it is the one most likely to be valid and authentic, given practice effects
• • •
Missing some Tweet in this thread? You can try to
force a refresh