馃懡 kevinsan

I need a way to efficiently filter out common dictionary words from a list of strings, ideally using existing CLI tools or libraries.

3 years ago

Actions

馃憢 Join Station

5 Replies

馃懡 nfc

Might want to filter out words which are not nouns, after removing stop words. ntlk can get you part of speech. 路 3 years ago

馃懡 akkartik

The keyword for this is "stop words". You can find lists of stop words online, and then filter them out using Python or `grep -vf`. 路 3 years ago

馃懡 marginalia

Grab a word frequency list off somewhere (I think I've found them on wikipedia at times), plop into a file, read it into a set in python, match vs that? 路 3 years ago

馃懡 kevinsan

@ethereal I have used aspell in the past, but its dictionaries are too comprehensive. I just want to throw out the most common English words. I'm looking at nltk and TextBlob in Python at the moment. I'm trying to extract 'interesting words' from bookmark titles. 路 3 years ago

馃懡 ethereal

Don't most linux distros ship with a list of dictionary words, primarly for spellchecking? Maybe you could start from there. 路 3 years ago