Unicode matching bug in AutoModerator

https://www.reddit.com/r/AutoModerator/comments/bn4u8j/unicode_matching_bug_in_automoderator/

created by dequeued on 10/05/2019 at 22:37 UTC*

14 upvotes, 5 top-level comments (showing 5)

At some point on or shortly before April 11th, something changed how Unicode text is being matched in AutoModerator and this broke some rules. As a result, rules dealing with non-ASCII stuff are matching incorrectly and this issue is being experienced by multiple subreddits.

Here's a small example that reproduces the issue:

--------------------------------------------------------------------------------

title+body (includes, case-sensitive): ['â']
moderators_exempt: false
action: filter
action_reason: "Test rule [{{match}}]"

--------------------------------------------------------------------------------

This rule matches on `’` (RIGHT SINGLE QUOTATION MARK U+2019).

Now, because `â` is U+00E2 and `’` just happens to be encoded as 0xE2 0x80 0x99 in UTF-8, I suspected that some change may have screwed up how text is handled in AutoModerator (or perhaps how text is being manipulated prior to AutoModerator processing). To confirm this, I also tested `†` (DAGGER U+2020) which is encoded as 0xE2 0x80 0xA0 in UTF-8. It also triggers the same incorrect match of `â`.

If an admin is reading this, you can see my test page at http://redd.it/bn4fld[1] and check the AutoModerator logs for matches that make no sense on that subreddit.

1: http://redd.it/bn4fld

Finally, comments and submissions that *should* trigger this rule (i.e., ones with an `â` present) no longer match.

Edit:

I'm pretty sure it's some sort of double-encoding or UTF-8 encoding issue. I tested a different rule with `ã` (U+00E3) and lo and behold, it matches on `あ` (U+3042 HIRAGANA LETTER A) because AutoModerator is passed 0xE3 0x81 0x82 (the UTF-8 for `あ`) instead of the proper Unicode.

Comments

Comment by redtaboo at 10/05/2019 at 23:18 UTC*

7 upvotes, 3 direct replies

Heya -- sorry about this, we're aware of this and are looking into it. Unfortunately, we may not have more information until next week, but I'll keep you posted!

thank you for this detailed information as well --I notice you're see an issues specifically with matching â incorrectly, /u/shiruken mentioned something similar but different elsewhere. That might be a lead, thank you!!

Comment by roionsteroids at 10/05/2019 at 23:12 UTC

2 upvotes, 1 direct replies

It has always been kinda buggy, especially with ranges.

Comment by dequeued at 10/05/2019 at 22:38 UTC

1 upvotes, 0 direct replies

tagging /u/alienth

Comment by Bardfinn at 10/05/2019 at 23:16 UTC

1 upvotes, 1 direct replies

I've talked with some folks who have had difficulty with U+2019 not registering in the appropriate Unicode groups for Regex -- As here[1]. A workaround was found for them to handle emoji without using the Unicode classes, and I never followed up to write a testrig to print out reports showing the `{{match}}` for the regex to demonstrate the scope of the problem.

1: https://www.reddit.com/r/AutoModerator/comments/bjsax4/the_emoji_rule_which_was_taken_directly_from_the/

Comment by Djentleman420 at 11/05/2019 at 19:45 UTC

1 upvotes, 1 direct replies

This explains why my attempt at a rule isn't working.. i am trying to remove posts that use any non-standard latin characters in titles. This was what i was trying:

priority: 2
title (includes, regex): ['[^\u0000-\u007f]']
moderators_exempt: false
comment: |
Your submission has been removed. The title may only include standard Latin characters 
(those on your keyboard).

If you wish to re-submit, please do so with only standard characters.
action: remove
action_reason: "Non-Standard Characters In Title"
---

Do you think if i were to replace the unicode range with every individual character it would work?