We're getting a lot of spam using Chinese character sets, so I set out to devise a subject line filter that passes only if it contains only English characters. The problem is that it is tripping occasionally on 100% pure English subject lines. I'm looking for help with this.
The RegEx is as follows:
[^\cA-\cZa-z0-9A-Z `~!@#$%^&*()_=+[{}|;:'",<.>/?\\\-\]\u2013]
Which in simple English is: when the text contains anything that is *NOT*
- a Control Character (char code 1 to 26)
- a lowercase a-z
- a numeral 0-9
- an Uppercase A-Z
- any of the following punctuation: <space>`~!@#$%^&*()_=+[{}|;:'",<.>/?
- any of the following special characters that require escaping: \-]
- the special unicode character 2013 (a dash used by some mail programs.)
However, the following subject line gets flagged by this:
New: Manage Avira Antivirus With O&O Syspectr
I have used both the great online tool RegExr at https://regexr.com/, as well as the Regex Match Tracer 2.1, and both didn't match:
However, Match Tracer v4 matches on the first letter of the string?
Can someone please assist. As you can see, I'm willing to put the time and effort in to get this right, and to share my results...
Regex help request: Filter Only English characters.
- rusticdog
- Firetrust Monkey
Post
Re: Regex help request: Filter Only English characters.
The DEELX engine we use does support POSIX sets http://www.regexlab.com/en/deelx/syntax/bas_set.htm , so you ought to be able to simplify the filter by just using
[:ascii:]
Which is the characters here http://www.asciitable.com/
Does that do the same ?
[:ascii:]
Which is the characters here http://www.asciitable.com/
Does that do the same ?
- DanAtCCD
- Travelling Tuatara
Post
Re: Regex help request: Filter Only English characters.
[^:ascii:] seems to help a lot -- I'll continue testing with the special characters (like © and ® and special dashes) and see if I can get it to be reliable.
Other ideas also appreciated...
Other ideas also appreciated...
- rusticdog
- Firetrust Monkey
Post
Re: Regex help request: Filter Only English characters.
We have some built in filters for language when the emails dictate certain character sets, these won't catch UTF-8 encoded messages however.
When you click the drop down beside 'Add Rule' in the filter you can choose Character Set rule >> then click the next drop down to choose which languages you want to filter on.
e.g.
When you click the drop down beside 'Add Rule' in the filter you can choose Character Set rule >> then click the next drop down to choose which languages you want to filter on.
e.g.
- DanAtCCD
- Travelling Tuatara
Post
Re: Regex help request: Filter Only English characters.
We're already using _all_ of those filters (we have no clients in non-English countries), and it's still not filtering these crafty e-mails.
By limiting the Subject line to use only English characters, with only a few exceptions, we hope to avoid non-English messages as well as any message that attempts to use unicode lookalikes to bypass filters. No reasonable person attempts to use an 'a' that isn't really a char(95) (or was it 60? Oh, nevermind.)
By limiting the Subject line to use only English characters, with only a few exceptions, we hope to avoid non-English messages as well as any message that attempts to use unicode lookalikes to bypass filters. No reasonable person attempts to use an 'a' that isn't really a char(95) (or was it 60? Oh, nevermind.)
- DanAtCCD
- Travelling Tuatara
Post
Re: Regex help request: Filter Only English characters.
I tried using [:ascii:], but it doesn't seem to be working. Can I use Regex Match Tracer to test this?
I would have expected that the first English character would have matched...? (I should have used [:ascii:].* if I wanted all of them, but that doesn't work either.)- rusticdog
- Firetrust Monkey
Post
Re: Regex help request: Filter Only English characters.
You are hitting F3 (Match button) to trigger the search ? I'm not having issue with example text I use to get a hit.
I think to match all ASCII you would use
[:ascii:]+
I think to match all ASCII you would use
[:ascii:]+
- DanAtCCD
- Travelling Tuatara
Post
Re: Regex help request: Filter Only English characters.
Doh! And here I was thinking the output was immediate.
Given that I only need to confirm the presence of even one non-ASCII character, I believe a rule of "Subject Containing [:^ascii:]" should work.
Fingers crossed, I'll report back once I have a few more garbage e-mails, likely after the weekend...
Given that I only need to confirm the presence of even one non-ASCII character, I believe a rule of "Subject Containing [:^ascii:]" should work.
Fingers crossed, I'll report back once I have a few more garbage e-mails, likely after the weekend...
- rusticdog
- Firetrust Monkey
Post
Re: Regex help request: Filter Only English characters.
Yeah that program does do some odd highlighting that gives you that impression. Fingers crossed MW works as expected
- DanAtCCD
- Travelling Tuatara
Post
Re: Regex help request: Filter Only English characters.
[:^ascii:] didn't catch a \u00AE (Ascii 174) (https://unicode-table.com/en/#00AE). Looks like I'll have to deal with that one, and with \u00A9 (Ascii 169) (https://unicode-table.com/en/#00A9) separately. I haven't gotten a glut of foreign subject lines this weekend like I usually do, so I haven't been able to compare against the regular corpus of foreign spam.