Let's discuss unicode

shermanp · Post by **shermanp** » Mon 20 Aug, 2018 1:29 am

Hi folks,

I've been thinking about unicode support lately (Differencing support may be coming to MiniVHD...), and wanted to have a discussion what would be the best way to make PCem unicode aware. My thoughts on the matter are as follows:

Use null terminated UTF-8 internally. UTF-8 has the advantage that it is compatible with standard C strings (char*), is backwards compatible with ASCII, and standard string functions still work. The only "downside" is that one can no longer assume 1 byte = 1 character, but I don't think PCem is doing much, if any, character counting anyway, so seems irrelevant. Using UTF-16/UTF-32 or similar, would require the use of wide character support, which would be a PITA to implement!
When getting text from outside PCem (eg: file paths), the char buffer probably needs to be sized 4 * character length, to ensure that we don't overflow our buffers!
Explicitly convert C strings to/from wxStrings with UTF-8 encoding, as wxStrings may use different encodings on different platforms.
Create a wrapper function for fopen(). On Linux/OSX, call fopen() directly using the UTF-8 string. On Windows, convert the UTF-8 string to a compatible wchar_t* string, and open with _wfopen().
Make sure that the log and config files can read/write as UTF-8.
Use the platform defined maximum path constants when dealing with path lengths.
Continue using ASCII only for string literals in the source code.

Any thoughts on the matter? Have I missed anything? Disagreements? Alternatives? Discuss below.

darksabre76 · Post by **darksabre76** » Mon 20 Aug, 2018 3:50 am

Would configuration files all have to be rewritten into UTF-8? It's been a while since I've seen a UTF-8 migration of any sort.

Otherwise, it sounds good that the project would be a little more futureproofed, especially if any translation work comes down the line.

shermanp · Post by **shermanp** » Mon 20 Aug, 2018 4:17 am

The config files would need to be UTF-8... however, valid ASCII is also valid UTF-8, so the config files won't have to be rewritten. Especially if we don't set the (entirely optional and unnecessary) UTF-8 BOM.

Let's discuss unicode

Let's discuss unicode

Re: Let's discuss unicode

Re: Let's discuss unicode