+ 5
Sucheta If this is your first time attempting to sort unicode characters, you may need to become familiar with the use of a collator library.
A collator is used to compare the sequencing order between two unicode characters for a given locale.
Try looking into this one for Go.
https://pkg.go.dev/golang.org/x/text@v0.3.6/collate
With examples:
https://discourse.gohugo.io/t/sort-slice-with-accented-characters/29217
https://blog.golang.org/matchlang
Here are a couple of other articles on this topic for other programming languages:
https://lemire.me/blog/2018/12/17/sorting-strings-properly-is-stupidly-hard/
https://unix.stackexchange.com/questions/423345/generate-collating-order-of-a-list-of-individual-characters
Hope this helps.
+ 3
Does this work well enough for you?
package main
import "fmt"
import "unicode/utf8"
func Reverse(s string) string {
totalLength := len(s)
buffer := make([]byte, totalLength)
for i := 0; i < totalLength; {
r, size := utf8.DecodeRuneInString(s[i:])
i += size
utf8.EncodeRune(buffer[totalLength-i:], r)
}
return string(buffer)
}
func main() {
tests := [...] string {
"Neuquén",
"ওহে বিশ্ব", // expected "শ্ববি হেও"
"नमस्ते दुनिया", // expected "यानिदु स्तेमन"
"안녕하세요 세계",
"こんにちは世界",
"Hello World",
}
for _, test := range tests {
fmt.Println(Reverse(test))
}
}
I copied the Reverse implementation from: https://www.socketloop.com/tutorials/golang-reverse-a-string-with-unicode
+ 3
Sucheta I'm glad I was able to help.
For clarification, I'm not yet familiar with Go, myself.
However, this approach for sorting (or the collation) of unicode characters according to the DUCET (Default Unicode Collation Element Table) per each language and culture is not exclusive to Go or any particular programming language. So... this will be good knowledge to have in the long run. 😉
You can learn more about this by reviewing the unicode technical standard known as the:
Unicode Collaboration Algorithm
https://unicode.org/reports/tr10/
Here are some other links that might help with a general understanding:
https://en.m.wikipedia.org/wiki/Collation
https://en.m.wikipedia.org/wiki/Unicode_collation_algorithm
Unfortunately... I'm not personally aware of any options to import packages outside the default environment here on SoloLearn.
It would be nice if some packages like those from text could be included by default for SL.
+ 1
I'll keep troubleshooting it tomorrow.
+ 1
Sucheta, do you know the Bangali or Hindi languages? I wonder if there is any special rules around leading or following spaces that change how the characters are to be displayed.
I did the following to troubleshoot but didn't fix the problem yet:
- validated using https://onlineutf8tools.com/validate-utf8 and https://validator.w3.org/ that the source code is using valid UTF-8 encodings.
- ran several string reverse implementations from golang on your test cases but kept getting the same undesirable results with the circle-like shape appearing.
- looked at the test cases in a hex editor especially around the UTF-8 encoded strings. Regular space characters were represented by 0x20 like expected everywhere. Multiple bytes represented each Bangali and Hindi character as expected.
- I added spaces around each and every character and found more undesirable circle-like characters rendered in the reversed output.
tests := [...] string {
"Neuquén",
"ওহে বিশ্ব", // expected "শ্ববি হেও"
"ও হে বি শ্ব", // expected "শ্ব বি হে ও"
"नमस्ते दुनिया", // expected "यानिदु स्तेमन"
"न म स्ते दु नि या", // expected "या नि दु स्ते म न"
"안녕하세요 세계",
"こんにちは世界",
"Hello World",
}
+ 1
The following HEX representation of character: হে
E0 A6 B9 E0 A7 87
Is reversed as: েহ
Represented in hex by:
E0 A7 87 E0 A6 B9
One weird thing about this case is my cursor moves over the original like it is a single character but the reversed above has my cursor move over it like there are 2 characters. The reversed 2 characters are: ে and হ
The following article mentions "combining characters" which is very likely related to the problem:
https://onlineunicodetools.com/add-combining-characters.
It says: "This utility adds combining characters to your Unicode data. Combining characters are small glyphs and marks that are added above, below, or on the main symbol. These marks can't be used as independent characters and they are intended only for modifying the main (base) character. Diacritical marks change the sound meaning of the letters to which they are added. For example, the word "naïve" uses the "◌̈" diaeresis mark, the word "saké" uses the "◌́" acute accent mark, and the word "breathèd" use the "◌̀" grave accent mark. "
Here is an article on combining characters:
https://en.wikipedia.org/wiki/Combining_character
The Korean character you mentioned: 안
Is represented by this Hex:
EC 95 88
Your reverse function keeps those exact same 3 bytes and the output is as expected.
It is only 3 bytes but the troublesome example is 6 bytes.
+ 1
It should be possible to get what you want but I haven't found a way and I'm running low on energy to continue looking for a solution to that.
If you had a way to detect those character combinations, you should be able to keep them grouped together so they're not swapped like the rest. This way, you keep the original byte sequence for the pair and it won't render badly. You could make a special case with your test cases but I didn't find a more elegant and robust solution.