💾 Archived View for diesenbacher.net › blog › entries › finding-duplicates.gmi captured on 2023-07-22 at 16:28:59. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Finding duplicate PKs in big CSV file

The challenge

Some days ago at work young $PADAWAN-COLLEAGUE had the task to extract all ids out of a big CSV file (~ 1.5 mio of lines), which were used multiple times. These ids should be unique like PKs in a database, duplicates are an error, therefore the task to find them.

Suprise - despite my hint - highly valued $PADAWAN-COLLEAGUE started with Excel. These young guys... As expected, MS Office 365 Excel has also a limit in number of lines that it can handle. IIRC once(tm) it was about 64k lines, today this increased to 1 mio of lines.

$PADAWAN-COLLEAGUE thought of splitting up the CSV in <1mio chunks, but bad idea, as you would miss duplicates accross the chunks...

Howto proceed? Importing into a database and some subselect?

Time for the $JEDI (yep, that's me) to jump in.

The salvation

Instead of awk, or something like that I choose the deadly weapon [J].

At first import the CSV-file and isolate the column with the PKs:

require 'tables/dsv'
input=: (';';'"') readdsv_pdsv_ 'file.csv' NB. Our CSV file uses ';' as delimiters
input=: >{:"1 input NB. get last column, and get rid of the boxes

And now, real J-magic: find first occurrences of the PK, negate the result, use this result to show the duplicates, show each duplicate just once:

~. (-. ~: input) # input NB. nub sieve input, negate it, copy only the duplicates, nub result

Voila - that's it :).

J - as always, have an elegant solution, with minimal effort in typing, but maximal confusion for non-enlightened (less nerdy) colleagues.

Finding duplicate PKs in big CSV file

The challenge

The salvation

Related Links