💾 Archived View for diesenbacher.net › blog › entries › finding-duplicates.gmi captured on 2023-05-24 at 17:45:51. Gemini links have been rewritten to link to archived content
⬅️ Previous capture (2023-01-29)
-=-=-=-=-=-=-
Some days ago at work young $PADAWAN-COLLEAGUE had the task to extract all ids out of a big CSV file (~ 1.5 mio of lines), which were used multiple times. These ids should be unique like PKs in a database, duplicates are an error, therefore the task to find them.
Suprise - despite my hint - highly valued $PADAWAN-COLLEAGUE started with Excel. These young guys... As expected, MS Office 365 Excel has also a limit in number of lines that it can handle. IIRC once(tm) it was about 64k lines, today this increased to 1 mio of lines.
$PADAWAN-COLLEAGUE thought of splitting up the CSV in <1mio chunks, but bad idea, as you would miss duplicates accross the chunks...
Howto proceed? Importing into a database and some subselect?
Time for the $JEDI (yep, that's me) to jump in.
Instead of awk, or something like that I choose the deadly weapon [J].
At first import the CSV-file and isolate the column with the PKs:
require 'tables/dsv' input=: (';';'"') readdsv_pdsv_ 'file.csv' NB. Our CSV file uses ';' as delimiters input=: >{:"1 input NB. get last column, and get rid of the boxes
And now, real J-magic: find first occurrences of the PK, negate the result, use this result to show the duplicates, show each duplicate just once:
~. (-. ~: input) # input NB. nub sieve input, negate it, copy only the duplicates, nub result
Voila - that's it :).
J - as always, have an elegant solution, with minimal effort in typing, but maximal confusion for non-enlightened (less nerdy) colleagues.