Converting UTF-8 Text to C/ C++ wide char­ac­ter strings

When browsing the C++ Standard Library for how to convert UTF-8 text to C wide character text wchar_t[] and vice versa one will be surprised to find that for such a common problem there is no built-in solution available. It seems one has to resort to the services of the operating system and write non-portable code; e.g. for the Objective-C runtime:

NSString* intrnl = [NSString stringWithContentsOfFile:path 
                                             encoding:NSUTF8StringEncoding 
                                                error:&e];
std::wstring wideTxt((wchar_t*)[[intrnl dataUsingEncoding:NSUTF32StringEncoding] bytes]
                     , [asData length] / sizeof(wchar_t));

Surprisingly a web search too does not reveal many light-weight and elegant alternatives. Either they require two separate copies of the same text in UTF-8 and Unicode like UtfConverter, they work just in one direction as utf8::ostream does, or like Poco::UnicodeConverter they stop at UTF-16 which is the wide character encoding on Windows.

When using the Boost C++ libraries however, UTF-8 conversion can be performed by just adding a single line of code after creating your input or output streams while relying on a thoroughly tested code-base.

Regrettably the conversion code is not header-only and thus requires at least one Boost library (e.g. serialization) to be built on your platform. If you think that this is an overkill, you can just grab the libs/detail/utf8_codecvt_facet.cpp file and add it to your compilation items of your target. This is what this post is about.

I found that with Boost 1.44 the original Hello World example published by Paul Dixon did not work as advertised: The linker kept complaining about the missing vtable of the utf8_codecvt_facet object which I found is caused by a missing definition of one of its methods. I guess that this on the other hand was the effect of non-matching namespaces in the header and implementation files. Since it got away when I removed all namespace macros in the original files as follows:

main.cpp:

#include <iostream>
#include <fstream>
#include <locale>
#include <string>

#if defined LINK_BOOST_SERIALIZATION_LIB
#include <boost/archive/detail/utf8_codecvt_facet.hpp>
#endif 
#include "utf8_codecvt_facet.hpp"
using namespace std;
int main (int argc, char * const argv[]) 
{
   wifstream inFile("utf8.txt");
#if defined LINK_BOOST_SERIALIZATION_LIB
   inFile.imbue(std::locale(std::locale(), new boost::archive::detail::utf8_codecvt_facet));
#endif
   inFile.imbue(std::locale(std::locale(), new boost::utf8_codecvt_facet));
   wstring wideString;
   inFile >> wideString;
   cout << "widestring.length()" << wideString.length() << endl;
   wstring line;
   while getline(inFile, line)	{
      wcout << line;
   }
   return 0;
}

utf8_codecvt_facet.hpp:

utf8_codecvt_facet.cpp:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: