+ 1

how to transform a unicode message to lowercase

i want to transform a message that is UTF-8 encoded, to lowcase. I use windows 10/11 so i prefer use windows API if possible, if not then use a 3rd party Library. I want the transformation of characters be correct for Chinese, Russian and other languages too. How can i do that?

22nd May 2024, 9:03 PM
john ds
john ds - avatar
16 odpowiedzi
+ 4
#include <iostream> using namespace std; int main() { string str = "JaScript"; for (auto& i : str) { i = tolower(i); } cout << str; }
22nd May 2024, 10:21 PM
JaScript
JaScript - avatar
+ 3
Bob_Li i experimenting a lot i found a window API that maybe working for all languages however it get an argument wbout locale that as i understand its information about my own system such as date/time format, floating number presentation and other. I don't know what this has to do with transfrormation to lowcase characters but i not like the fact that it need such an argument that is related to the system preferences. My app must run on every system and results not depends on the system settings. I not want to display the result. I use it for a advanced system that used to prevent spamming and advertising bots. I need to convert all characters to lowcase (possiblity will do more processing like remove spaces and tabs) and then check if specific words or parts of known advertising messages found or not in the message a user sent. I have created something similar with processing message input before pass it to the decision part for my AI model i created but my AI designed to work only with English language
23rd May 2024, 2:22 AM
john ds
john ds - avatar
+ 3
Since you mentioned Greek and gave an example, I made a little search on case converting with Greek. And it turns out it is much difficult than I think before. It seems there are special rules / cases when converting to lowercase. https://leonidas.net/greek-type-design/case-conversion-for-monotonic-greek/
23rd May 2024, 8:41 AM
Wong Hei Ming
Wong Hei Ming - avatar
+ 2
john ds The code JaScript posted works with Chinese, and I tested it with Japanese too. However, these languages don't have uppercase and lowercase, the output is the same as the string. https://sololearn.com/compiler-playground/cBPg9AlFQVlO/?ref=app
23rd May 2024, 1:23 AM
Wong Hei Ming
Wong Hei Ming - avatar
+ 2
Bob_Li trust me if i had full control of the messages sent in whole network i ll simple just detect those bots with other techniques not even analyze their text and ban or block their account from sending messages. Unfortunately i not have any control or access to databases, i cannot know how many times and to who someone sent the same message or how many messages he sent in like the last 1 hour so i can rate limit him. The only way is to make fast a decision if based on the message i received, should present it or not to user
23rd May 2024, 11:47 AM
john ds
john ds - avatar
+ 1
JaScript Wong Hei Ming tolower() not working with Greek words for example "Καλό" must convert to "καλό", i can assume it only working for ASCII characters
23rd May 2024, 2:05 AM
john ds
john ds - avatar
+ 1
john ds Do some experimenting on your system. Are the results going to be displayed in a gui app? a console terminal? saved to file? displayed in a browser? Do you have the fonts needed to display the other languages?
23rd May 2024, 2:08 AM
Bob_Li
Bob_Li - avatar
+ 1
Bob_Li the AI model i mentioned, i use it for 'virtual assistant', isn't related to anti-advertising, i just said for this AI i convert input to lowcase too using just a bit shift and a check for characters range (must be ASCII), so it wasn't problem. Now it is a problem as for anti-advertising bots as this old school rule-based approach works really well, it's very easy to implement, has minimum false positives and should work for every user world-wide. Just have to keep the blacklist up-to-date. Wong Hei Ming i know. Every language has its own rules, based on Greek language that i know well, no need AI to convert a message to lowcase (like translate), so can avoid web services that delay things. It's a standard process. But we cannot always converter 1 single character. For example in English the letter S can be { S, s }. In Greek has 3 versions { Σ, σ, ς }. 'Σ' is upper, "σ, ς" are lower. 'S' convert to 'Σ' always, but for the lower 's', if the letter is the last in the word must be 'ς', else is 'σ'
23rd May 2024, 10:26 AM
john ds
john ds - avatar
+ 1
Bob_Li not really, they sending the exact same message all the time. I not comparing the whole message, only specific domains, invite links and some keywords like their service name. What they can do is add spaces between url or change some upper to lower case. This is what i try to cover.
23rd May 2024, 11:26 AM
john ds
john ds - avatar
+ 1
john ds yes, there are more efficient ways of filtering spam other than content filtering. Blacklisting IPs is probably the simplest.
23rd May 2024, 12:17 PM
Bob_Li
Bob_Li - avatar
0
You most likely do not need the Windows API for this. The message is just a string, correct?
22nd May 2024, 10:17 PM
Wilbur Jaywright
Wilbur Jaywright - avatar
0
Wilbur Jaywright it's a const char* but i can convert to string if need
23rd May 2024, 12:05 AM
john ds
john ds - avatar
0
JaScript this will work with Chinese, Turkish, Russian etc? in UTF-8 not all characters are 1 byte
23rd May 2024, 12:07 AM
john ds
john ds - avatar
0
john ds ok. but wouldn't just be the same as ignoring the non-ascii data? The spam might have been written in a foreign language... current strategies for spam filtering are more into trainig AI to do the job rather than the old school rule-based approach.
23rd May 2024, 4:14 AM
Bob_Li
Bob_Li - avatar
0
john ds you must have a really good blacklist. 😁 The people updating it is doing a good job. Going multilingual is going to be a big added workload, specially since spammers are using AI to blend in better to normal messages.
23rd May 2024, 11:20 AM
Bob_Li
Bob_Li - avatar
0
john ds Then you are lucky for not attracting the smart ones...😁
23rd May 2024, 11:28 AM
Bob_Li
Bob_Li - avatar