Let's discuss unicode

Discussion of development and patch submission.
Post Reply
shermanp
Posts: 175
Joined: Sat 18 Feb, 2017 2:09 am

Let's discuss unicode

Post by shermanp »

Hi folks,

I've been thinking about unicode support lately (Differencing support may be coming to MiniVHD...), and wanted to have a discussion what would be the best way to make PCem unicode aware. My thoughts on the matter are as follows:
  • Use null terminated UTF-8 internally. UTF-8 has the advantage that it is compatible with standard C strings (char*), is backwards compatible with ASCII, and standard string functions still work. The only "downside" is that one can no longer assume 1 byte = 1 character, but I don't think PCem is doing much, if any, character counting anyway, so seems irrelevant. Using UTF-16/UTF-32 or similar, would require the use of wide character support, which would be a PITA to implement!
  • When getting text from outside PCem (eg: file paths), the char buffer probably needs to be sized 4 * character length, to ensure that we don't overflow our buffers!
  • Explicitly convert C strings to/from wxStrings with UTF-8 encoding, as wxStrings may use different encodings on different platforms.
  • Create a wrapper function for fopen(). On Linux/OSX, call fopen() directly using the UTF-8 string. On Windows, convert the UTF-8 string to a compatible wchar_t* string, and open with _wfopen().
  • Make sure that the log and config files can read/write as UTF-8.
  • Use the platform defined maximum path constants when dealing with path lengths.
  • Continue using ASCII only for string literals in the source code.
Any thoughts on the matter? Have I missed anything? Disagreements? Alternatives? Discuss below.
darksabre76
Posts: 69
Joined: Tue 12 Sep, 2017 4:33 am
Location: Seattle, WA, USA
Contact:

Re: Let's discuss unicode

Post by darksabre76 »

Would configuration files all have to be rewritten into UTF-8? It's been a while since I've seen a UTF-8 migration of any sort.

Otherwise, it sounds good that the project would be a little more futureproofed, especially if any translation work comes down the line.
shermanp
Posts: 175
Joined: Sat 18 Feb, 2017 2:09 am

Re: Let's discuss unicode

Post by shermanp »

The config files would need to be UTF-8... however, valid ASCII is also valid UTF-8, so the config files won't have to be rewritten. Especially if we don't set the (entirely optional and unnecessary) UTF-8 BOM.
Post Reply