+ 1
how to transform a unicode message to lowercase
i want to transform a message that is UTF-8 encoded, to lowcase. I use windows 10/11 so i prefer use windows API if possible, if not then use a 3rd party Library. I want the transformation of characters be correct for Chinese, Russian and other languages too. How can i do that?
16 Réponses
+ 4
#include <iostream>
using namespace std;
int main()
{
string str = "JaScript";
for (auto& i : str) {
i = tolower(i);
}
cout << str;
}
+ 3
Bob_Li i experimenting a lot i found a window API that maybe working for all languages however it get an argument wbout locale that as i understand its information about my own system such as date/time format, floating number presentation and other. I don't know what this has to do with transfrormation to lowcase characters but i not like the fact that it need such an argument that is related to the system preferences. My app must run on every system and results not depends on the system settings.
I not want to display the result. I use it for a advanced system that used to prevent spamming and advertising bots. I need to convert all characters to lowcase (possiblity will do more processing like remove spaces and tabs) and then check if specific words or parts of known advertising messages found or not in the message a user sent.
I have created something similar with processing message input before pass it to the decision part for my AI model i created but my AI designed to work only with English language
+ 3
Since you mentioned Greek and gave an example, I made a little search on case converting with Greek.
And it turns out it is much difficult than I think before. It seems there are special rules / cases when converting to lowercase.
https://leonidas.net/greek-type-design/case-conversion-for-monotonic-greek/
+ 2
john ds
The code JaScript posted works with Chinese, and I tested it with Japanese too.
However, these languages don't have uppercase and lowercase, the output is the same as the string.
https://sololearn.com/compiler-playground/cBPg9AlFQVlO/?ref=app
+ 2
Bob_Li trust me if i had full control of the messages sent in whole network i ll simple just detect those bots with other techniques not even analyze their text and ban or block their account from sending messages. Unfortunately i not have any control or access to databases, i cannot know how many times and to who someone sent the same message or how many messages he sent in like the last 1 hour so i can rate limit him.
The only way is to make fast a decision if based on the message i received, should present it or not to user
+ 1
JaScript Wong Hei Ming tolower() not working with Greek words for example "Καλό" must convert to "καλό", i can assume it only working for ASCII characters
+ 1
john ds
Do some experimenting on your system. Are the results going to be displayed in a gui app? a console terminal? saved to file? displayed in a browser?
Do you have the fonts needed to display the other languages?
+ 1
Bob_Li the AI model i mentioned, i use it for 'virtual assistant', isn't related to anti-advertising, i just said for this AI i convert input to lowcase too using just a bit shift and a check for characters range (must be ASCII), so it wasn't problem. Now it is a problem as for anti-advertising bots as this old school rule-based approach works really well, it's very easy to implement, has minimum false positives and should work for every user world-wide. Just have to keep the blacklist up-to-date.
Wong Hei Ming i know. Every language has its own rules, based on Greek language that i know well, no need AI to convert a message to lowcase (like translate), so can avoid web services that delay things. It's a standard process.
But we cannot always converter 1 single character. For example in English the letter S can be { S, s }. In Greek has 3 versions { Σ, σ, ς }. 'Σ' is upper, "σ, ς" are lower. 'S' convert to 'Σ' always, but for the lower 's', if the letter is the last in the word must be 'ς', else is 'σ'
+ 1
Bob_Li not really, they sending the exact same message all the time. I not comparing the whole message, only specific domains, invite links and some keywords like their service name. What they can do is add spaces between url or change some upper to lower case. This is what i try to cover.
+ 1
john ds
yes, there are more efficient ways of filtering spam other than content filtering. Blacklisting IPs is probably the simplest.
0
You most likely do not need the Windows API for this. The message is just a string, correct?
0
Wilbur Jaywright it's a const char* but i can convert to string if need
0
JaScript this will work with Chinese, Turkish, Russian etc? in UTF-8 not all characters are 1 byte
0
john ds
ok. but wouldn't just be the same as ignoring the non-ascii data? The spam might have been written in a foreign language...
current strategies for spam filtering are more into trainig AI to do the job rather than the old school rule-based approach.
0
john ds
you must have a really good blacklist. 😁 The people updating it is doing a good job. Going multilingual is going to be a big added workload, specially since spammers are using AI to blend in better to normal messages.
0
john ds
Then you are lucky for not attracting the smart ones...😁