UTF8 - Simple Library for Internationalization
Loading...
Searching...
No Matches
UTF8 - Simple Library for Internationalization

This is a library designed to simplify the usage of UTF-8 strings under Win32 using the strategy advocated by UTF-8 Everywhere Manifesto.

While most of the (computing) world has standardized on using UTF-8 encoding, Win32 has remained stuck with wide character strings (also called UTF-16 encoding).

Content

The main function groups are:

There are also functions for:

In addition to those, there are wrappings for commonly used C and C++ functions:

Usage

General

Before using this library you might want to review the guidelines from the UTF-8 Everywhere Manifesto. In particular:

  • define UNICODE or _UNICODE in your program
  • for Visual Studio users, make sure "Use Unicode Character Set" option is defined under "General" > "Project Defaults" tab.
  • for Visual Studio users, add /utf-8 option under "C/C++" > "All Options" > "Additional Options".
  • use only std::string and char* variables. Assume they all contain UTF-8 encoded strings.
  • use UTF-16 strings only in arguments to Windows API calls.

All functions and classes in this library are included in the utf8 namespace. It is a good idea not to have a using directive for this namespace. That makes it more evident in the code where UTF8-aware functions are used.

This is an example of a function call:

std::string dirname = "ελληνικό";
utf8::mkdir (dirname); //create a directory with a UTF8-encoded name

Most functions mimic the behavior of standard C functions. A notable exception is the access() function. The utf8::access() function returns a bool value while the standard library function returns 0 if the file can be accessed.

I/O Streams

Wrappers for C++ I/O streams can be used for file streams with UTF-8 encoded filenames. This is an example of a C++ stream with a weird name and content:

std::string filename = "ελληνικό";
std::string filetext{ "😃😎😛" };
utf8::ofstream u8strm(filename);
u8strm << filetext << endl;
u8strm.close ();

Translation from UTF-16 to UTF-8 applies only to file names. C++ streams are agnostic about their content.

Calling Windows API functions can be handled using the generic widening and narrowing functions like this example:

std::string filename = "ελληνικό";
HANDLE f = CreateFile (utf8::widen (filename).c_str (), GENERIC_READ, 0,
NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
std::wstring widen(const char *s, size_t nch)
Conversion from UTF-8 to wide character.
Definition utf8.cpp:207

INI File Handling

INI files, also called "profile files" in Microsoft parlance are still widely used for storing application settings used either for compatibility reasons or because they are simple to work with.

The problem is that the basic Windows API calls for reading and writing INI files, GetPrivateProfileString and PutPrivateProfileString, combine both the file name and the information to be read or written in one API call. As an example, here is the signature of the GetPrivateProfileStringW function:

DWORD GetPrivateProfileStringW(
LPCWSTR lpAppName,
LPCWSTR lpKeyName,
LPCWSTR lpDefault,
LPWSTR lpReturnedString,
DWORD nSize,
LPCWSTR lpFileName
);

Using the utf8::widen function to convert all UTF-8 strings would produce an INI file that contains UTF-16 characters.

The solution is to completely forget about the Windows API functions and roll a new implementation for accessing INI files. This is by far not the only implementation of INI files that you can find out there. For a list of implementations you can check the Wikipedia page. This implementation struggles to be as compatible as possible with the original Windows API.

The only changes compared to the Windows API are:

  • line length defaults to 1024 (the INI_BUFFER_SIZE value) while Windows limits it to 256 characters
  • files without a path are in current directory while Windows places them in Windows folder

Case Conversion

Case conversion in Unicode is a more complicated issue than ASCII case conversion. This library uses standard tables published by Unicode Consortium to perform conversions between upper-case and lower-case. There is also a function utf8::icompare() that performs string comparison ignoring the case.

Error Handling

Invalid characters or sequences can be handled in two different ways:

The function utf8::error_mode() selects the error handling strategy. The error handling strategy is thread-safe: each thread has its own strategy.

Building

The UTF8 library doesn't have any dependencies. The test program however uses the UTTP library.

The preferred method is to use the CPM - C/C++ Package Manager to fetch all dependent packages and build them. Download the CPM program and, from the root of the development tree, issue the cpm command:

cpm -u https://github.com/neacsum/utf8.git utf8

The Visual C++ projects are set to compile under C++17 rules. If you are using C++20 rules, you have to add the Zc:char8_t- option.

You can build the library using CMake. From the utf8 directory:

cmake -S . -B build
cmake --build build

Alternatively, BUILD.bat script will build the libraries and test programs.

License

The MIT License (MIT)

Copyright (c) 2014-2023 Mircea Neacsu

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.