I've been thinking about unicode support lately (Differencing support may be coming to MiniVHD...), and wanted to have a discussion what would be the best way to make PCem unicode aware. My thoughts on the matter are as follows:
- Use null terminated UTF-8 internally. UTF-8 has the advantage that it is compatible with standard C strings (char*), is backwards compatible with ASCII, and standard string functions still work. The only "downside" is that one can no longer assume 1 byte = 1 character, but I don't think PCem is doing much, if any, character counting anyway, so seems irrelevant. Using UTF-16/UTF-32 or similar, would require the use of wide character support, which would be a PITA to implement!
- When getting text from outside PCem (eg: file paths), the char buffer probably needs to be sized 4 * character length, to ensure that we don't overflow our buffers!
- Explicitly convert C strings to/from wxStrings with UTF-8 encoding, as wxStrings may use different encodings on different platforms.
- Create a wrapper function for fopen(). On Linux/OSX, call fopen() directly using the UTF-8 string. On Windows, convert the UTF-8 string to a compatible wchar_t* string, and open with _wfopen().
- Make sure that the log and config files can read/write as UTF-8.
- Use the platform defined maximum path constants when dealing with path lengths.
- Continue using ASCII only for string literals in the source code.