I need a way to efficiently filter out common dictionary words from a list of strings, ideally using existing CLI tools or libraries.
3 years ago
Might want to filter out words which are not nouns, after removing stop words. ntlk can get you part of speech. 路 3 years ago
The keyword for this is "stop words". You can find lists of stop words online, and then filter them out using Python or `grep -vf`. 路 3 years ago
Grab a word frequency list off somewhere (I think I've found them on wikipedia at times), plop into a file, read it into a set in python, match vs that? 路 3 years ago
@ethereal I have used aspell in the past, but its dictionaries are too comprehensive. I just want to throw out the most common English words. I'm looking at nltk and TextBlob in Python at the moment. I'm trying to extract 'interesting words' from bookmark titles. 路 3 years ago
Don't most linux distros ship with a list of dictionary words, primarly for spellchecking? Maybe you could start from there. 路 3 years ago