gemini - kennedy.gemi.dev

💾 Archived View for station.martinrue.com › kevinsan › 64a6d5d2ad364e4d8f307514c87601da captured on 2022-07-16 at 18:51:56. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

👽 kevinsan

I need a way to efficiently filter out common dictionary words from a list of strings, ideally using existing CLI tools or libraries.

10 months ago

Actions

👋 Join Station

5 Replies

👽 nfc

Might want to filter out words which are not nouns, after removing stop words. ntlk can get you part of speech. · 10 months ago

👽 akkartik

The keyword for this is "stop words". You can find lists of stop words online, and then filter them out using Python or `grep -vf`. · 10 months ago

👽 marginalia

Grab a word frequency list off somewhere (I think I've found them on wikipedia at times), plop into a file, read it into a set in python, match vs that? · 10 months ago

👽 kevinsan

@ethereal I have used aspell in the past, but its dictionaries are too comprehensive. I just want to throw out the most common English words. I'm looking at nltk and TextBlob in Python at the moment. I'm trying to extract 'interesting words' from bookmark titles. · 10 months ago

👽 ethereal

Don't most linux distros ship with a list of dictionary words, primarly for spellchecking? Maybe you could start from there. · 10 months ago