Sane C++ Libraries
C++ Platform Abstraction Libraries
Strings

🟩 String formatting / conversion / manipulation (ASCII / UTF8 / UTF16)

Strings library allow read-only and write string operations and UTF Conversions.

Features

Class Description
SC::String A non-modifiable owning string with associated encoding.
SC::StringBuilder Builds String out of a sequence of StringView or formatting through StringFormat.
SC::StringConverter Converts String to a different encoding (UTF8, UTF16).
SC::StringIterator A position inside a fixed range [start, end) of UTF code points.
SC::StringIteratorASCII A string iterator for ASCII strings.
SC::StringIteratorUTF8 A string iterator for UTF8 strings.
SC::StringIteratorUTF16 A string iterator for UTF16 strings.
SC::StringView Non-owning view over a range of characters with UTF Encoding.
SC::StringAlgorithms Algorithms operating on strings (glob / wildcard).
SC::StringViewTokenizer Splits a StringView in tokens according to separators.
SC::StringFormat Formats String with a simple DSL embedded in the format string.
SC::Console Writes to console using SC::StringFormat.

Status

🟩 Usable
Library is usable and can be successfully used to mix operations with strings made in different encodings.

Definition

StringView

Non-owning view over a range of characters with UTF Encoding. It additional also holds the SC::StringEncoding information (ASCII, UTF8 or UTF16). During construction the encoding information and the null-termination state must be specified. All methods are const because it's not possible to modify a string with it.
Example (Construct)

StringView s("asd");
SC_ASSERT_RELEASE(s.sizeInBytes() == 3);
SC_ASSERT_RELEASE(s.isNullTerminated());
#define SC_ASSERT_RELEASE(e)
Assert expression e to be true.
Definition: Assert.h:66

Example (Construct from null terminated string)

const char* someString = "asdf";
// construct only "asd", not null terminated (as there is 'f' after 'd')
StringView s({someString, strlen(asd) - 1}, false, StringEncoding::Ascii);
SC_ASSERT_RELEASE(s.sizeInBytes() == 3);
SC_ASSERT_RELEASE(not s.isNullTerminated());
//
// ... or
StringView s2 = StringView::fromNullTerminated(s, StringEncoding::Ascii); // s2 == "asdf"

StringView::containsString

Check if StringView contains another StringView with compatible encoding.

Parameters
strThe other StringView to check with current
Returns
Returns true if this StringView contains str
Warning
This method will assert if strings have non compatible encoding. It can be checked with StringView::hasCompatibleEncoding (str) == true

Example:

StringView asd = "123 456";
SC_TRY(asd.containsString("123"));
SC_TRY(asd.containsString("456"));
SC_TRY(not asd.containsString("124"));
SC_TRY(not asd.containsString("4567"));
#define SC_TRY(expression)
Checks the value of the given expression and if failed, returns this value to caller.
Definition: Result.h:47

StringView::compare

Ordering comparison between non-normalized StringView (operates on code points, not on utf graphemes)

Parameters
otherThe string being compared to current one
Returns
Result of the comparison (smaller, equals or bigger)

Example:

// àèìòù (1 UTF16-LE sequence, 2 UTF8 sequence)
SC_ASSERT_RELEASE("\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9"_u8.compare(
"\xe0\x0\xe8\x0\xec\x0\xf2\x0\xf9\x0"_u16) == StringView::Comparison::Equals);
// 日本語語語 (1 UTF16-LE sequence, 3 UTF8 sequence)
StringView stringUtf8 = StringView("\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e\xe8\xaa\x9e\xe8\xaa\x9e"_u8);
StringView stringUtf16 = StringView("\xE5\x65\x2C\x67\x9E\x8a\x9E\x8a\x9E\x8a\x00"_u16); // LE
// Comparisons are on code points NOT grapheme clusters!!
SC_ASSERT_RELEASE(stringUtf8.compare(stringUtf16) == StringView::Comparison::Equals);
SC_ASSERT_RELEASE(stringUtf16.compare(stringUtf8) == StringView::Comparison::Equals);
SC_ASSERT_RELEASE(stringUtf8 == stringUtf16);
SC_ASSERT_RELEASE(stringUtf16 == stringUtf8);

StringView::fullyOverlaps

Check if this StringView is equal to other StringView (operates on code points, not on utf graphemes). Returns the number of code points that are the same in both StringView-s.

Parameters
otherThe StringView to be compared to
commonOverlappingPointsnumber of equal code points in both StringView
Returns
true if the two StringViews are equal

Example:

StringView asd = "123 456"_a8;
size_t overlapPoints = 0;
SC_TEST_EXPECT(not asd.fullyOverlaps("123___", overlapPoints) and overlapPoints == 3);
#define SC_TEST_EXPECT(e)
Records a test expectation (eventually aborting or breaking o n failed test)
Definition: Testing.h:113

StringView::startsWithAnyOf

Check if StringView starts with any utf code point in the given span.

Parameters
codePointsThe utf code points to check against
Returns
Returns true if this StringView starts with any code point inside codePoints

Example:

SC_TEST_EXPECT("123 456".startsWithAnyOf({'1', '8'})); // '1' will match

StringView::endsWithAnyOf

Check if StringView ends with any utf code point in the given span.

Parameters
codePointsThe utf code points to check against
Returns
Returns true if this StringView ends with any code point inside codePoints

Example:

SC_TEST_EXPECT("123 456".endsWithAnyOf({'a', '6'})); // '6' will match

StringView::startsWith

Check if StringView starts with another StringView.

Parameters
strThe other StringView to check with current
Returns
Returns true if this StringView starts with str

Example:

SC_TEST_EXPECT("123 456".startsWith("123"));

StringView::endsWith

Check if StringView ends with another StringView.

Parameters
strThe other StringView to check with current
Returns
Returns true if this StringView ends with str

Example:

SC_TEST_EXPECT("123 456".endsWith("456"));

StringView::containsString

Check if StringView contains another StringView with compatible encoding.

Parameters
strThe other StringView to check with current
Returns
Returns true if this StringView contains str
Warning
This method will assert if strings have non compatible encoding. It can be checked with StringView::hasCompatibleEncoding (str) == true

Example:

StringView asd = "123 456";
SC_TRY(asd.containsString("123"));
SC_TRY(asd.containsString("456"));
SC_TRY(not asd.containsString("124"));
SC_TRY(not asd.containsString("4567"));

StringView::containsCodePoint

Check if StringView contains given utf code point.

Parameters
cThe utf code point to check against
Returns
Returns true if this StringView contains code point c

StringView::sliceStartEnd

Get slice [start, end) starting at offset start and ending at end (measured in utf code points)

Parameters
startThe initial code point where the slice starts
endOne after the final code point where the slice ends
Returns
The [start, end) StringView slice

Example:

StringView str = "123_567";
SC_TEST_EXPECT(str.sliceStartEnd(0, 3) == "123");
SC_TEST_EXPECT(str.sliceStartEnd(4, 7) == "567");

StringView::sliceStartLength

Get slice [start, start+length] starting at offset start and of length code points.

Parameters
startThe initial code point where the slice starts
lengthOne after the final code point where the slice ends
Returns
The [start, start+length] StringView slice

Example:

StringView str = "123_567";
SC_TEST_EXPECT(str.sliceStartLength(7, 0) == "");
SC_TEST_EXPECT(str.sliceStartLength(0, 3) == "123");

StringView::sliceStart

Get slice [offset, end] measured in utf code points.

Parameters
offsetThe initial code point where the slice starts
Returns
The sliced StringView [offset, end]

Example:

StringView str = "123_567";
SC_TEST_EXPECT(str.sliceStart(4) == "567");

StringView::sliceEnd

Get slice [end-offset, end] measured in utf code points.

Parameters
offsetThe initial code point where the slice starts
Returns
The sliced StringView [end-offset, end]

Example:

StringView str = "123_567";
SC_TEST_EXPECT(str.sliceEnd(4) == "123");

StringView::trimEndAnyOf

Returns a shortened StringView removing ending utf code points matching the codePoints span.

Parameters
codePointsThe span of utf code points to look for
Returns
The trimmed StringView

Example:

SC_TEST_EXPECT("myTest_\n__"_a8.trimEndAnyOf({'_', '\n'}) == "myTest");
SC_TEST_EXPECT("myTest"_a8.trimEndAnyOf({'_'}) == "myTest");

StringView::trimStartAnyOf

Returns a shortened StringView removing starting utf code points matching the codePoints span.

Parameters
codePointsThe span of utf code points to look for
Returns
The trimmed StringView

Example:

SC_TEST_EXPECT("__\n_myTest"_a8.trimStartAnyOf({'_', '\n'}) == "myTest");
SC_TEST_EXPECT("_myTest"_a8.trimStartAnyOf({'_'}) == "myTest");

StringViewTokenizer

Splits a StringView in tokens according to separators.

StringViewTokenizer::tokenizeNext

Splits the string along a list of separators.

Parameters
separatorsList of separators
optionsIf to skip empty tokens or not
Returns
true if there are additional tokens to parse
Example:
StringViewTokenizer tokenizer("bring,me,the,horizon");
while (tokenizer.tokenizeNext(',', StringViewTokenizer::SkipEmpty))
{
console.printLine(tokenizer.component);
}

StringViewTokenizer::countTokens

Count the number of tokens that exist in the string view passed in constructor, when splitted along the given separators.

Parameters
separatorsSeparators to split the original string with
Returns
Current StringViewTokenizer to inspect SC::StringViewTokenizer::numSplitsNonEmpty or SC::StringViewTokenizer::numSplitsTotal.
Example:
SC_TEST_EXPECT(StringViewTokenizer("___").countTokens('_').numSplitsNonEmpty == 0);
SC_TEST_EXPECT(StringViewTokenizer("___").countTokens('_').numSplitsTotal == 3);

StringBuilder

Builds String out of a sequence of StringView or formatting through StringFormat. The output can be a SC::Vector (or a SC::SmallVector, see Containers)

StringBuilder::format

Uses StringFormat to format the given StringView against args, replacing destination contents.

Template Parameters
TypesType of Args
Parameters
fmtThe format strings
argsarguments to format
Returns
true if format succeeded
String buffer(StringEncoding::Ascii); // Or SmallString<N>
StringBuilder builder(buffer);
SC_TRY(builder.format("[{1}-{0}]", "Storia", "Bella"));
SC_ASSERT_RELEASE(builder.view() == "[Bella-Storia]");

StringBuilder::append

Uses StringFormat to format the given StringView against args, appending to destination contents.

Template Parameters
TypesType of Args
Parameters
fmtThe format strings
argsarguments to format
Returns
true if format succeeded
Example:
String buffer(StringEncoding::Ascii); // Or SmallString<N>
StringBuilder builder(buffer);
SC_TRY(builder.append("Salve"));
SC_TRY(builder.append(" {1} {0}!!!", "tutti", "a"));
SC_ASSERT_RELEASE(builder.view() == "Salve a tutti!!!");

StringBuilder::appendReplaceAll

Appends source to destination buffer, replacing occurrencesOf StringView with StringView with

Parameters
sourceThe StringView to be appended
occurrencesOfThe StringView to be searched inside source
withThe replacement StringView to be written in destination buffer
Returns
true if append succeeded

Example:

String buffer(StringEncoding::Ascii);
StringBuilder builder(buffer);
SC_TEST_EXPECT(builder.appendReplaceAll("123 456 123 10", "123", "1234"));
SC_TEST_EXPECT(buffer == "1234 456 1234 10");
buffer = String();
SC_TEST_EXPECT(builder.appendReplaceAll("088123", "123", "1"));
SC_TEST_EXPECT(buffer == "0881");

StringBuilder::appendReplaceMultiple

Appends source to destination buffer, replacing multiple substitutions pairs.

Parameters
sourceThe StringView to be appended
substitutionsFor each substitution in the span, the first is searched and replaced with the second.
Returns
true if append succeeded

Example:

String buffer(StringEncoding::Utf8);
StringBuilder sb(buffer);
SC_TEST_EXPECT(sb.appendReplaceMultiple("asd\\salve\\bas"_u8, {{"asd", "un"}, {"bas", "a_tutti"}, {"\\", "/"}}));
SC_TEST_EXPECT(buffer == "un/salve/a_tutti");

StringBuilder::appendHex

Appends given binary data escaping it as hexadecimal ASCII characters.

Parameters
dataBinary data to append to destination buffer
casingSpecifies if it should be appended using upper case or lower case
Returns
true if append succeeded

Example:

uint8_t bytes[4] = {0x12, 0x34, 0x56, 0x78};
String buffer;
StringBuilder builder(buffer);
SC_TEST_EXPECT(builder.appendHex({bytes, sizeof(bytes)}, StringBuilder::AppendHexCase::UpperCase));
SC_TEST_EXPECT(buffer.view() == "12345678");
unsigned char uint8_t
Platform independent (1) byte unsigned int.
Definition: PrimitiveTypes.h:36

String

A non-modifiable owning string with associated encoding. SC::String is (currently) implemented as a SC::Vector with the associated string encoding. A SC::StringView can be obtained from it calling SC::String::view method but it's up to the user making sure that the usage of such SC::StringView doesn't exceed lifetime of the SC::String it originated from (but thankfully Address Sanitizer will catch the issue if it goes un-noticed).

StringIterator

A position inside a fixed range [start, end) of UTF code points. It's a range of bytes (start and end pointers) with a current pointer pointing at a specific code point of the range. There are three classes derived from it (SC::StringIteratorASCII, SC::StringIteratorUTF8 and SC::StringIteratorUTF16) and they allow doing operations along the string view in UTF code points.

Note
Code points are not the same as perceived characters (that would be grapheme clusters). Invariants: start <= end and it >= start and it <= end.
Template Parameters
CharIteratorStringIteratorASCII, StringIteratorUTF8 or StringIteratorUTF16

StringFormat

Formats String with a simple DSL embedded in the format string. This is a small implementation to format using a minimal string based DSL, but good enough for simple usages. It uses the same {} syntax and supports positional arguments.
StringFormat::format(output, "{1} {0}", "World", "Hello") is formatted as "Hello World".
Inside the {} after a colon (:) a specification string can be used to indicate how to format the given value. As the backend for actual number to string formatting is snprintf, such specification strings are the same as what would be given to snprintf. For example passing "{:02}" is transformed to "%.02f" when passed to snprintf.
{ is escaped if found near to another {. In other words format("{{") will print a single {.

Example:

String buffer(StringEncoding::Ascii);
StringBuilder builder(buffer);
SC_TEST_EXPECT(builder.format("{1}_{0}_{1}", 1, 0));
SC_TEST_EXPECT(buffer == "0_1_0");
SC_TEST_EXPECT(builder.format("{0:.2}_{1}_{0:.4}", 1.2222, "salve"));
SC_TEST_EXPECT(buffer == "1.22_salve_1.2222");
Note
It's not convenient to use SC::StringFormat directly, as you should probably use SC::StringBuilder
Template Parameters
RangeIteratorType of the specific StringIterator used

StringConverter

Converts String to a different encoding (UTF8, UTF16). SC::StringConverter converts strings between different UTF encodings and can add null-terminator if requested. When the SC::StringView is already null-terminated, the class just forwards the original SC::StringView.

Example:

const char utf8String1[] = "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"; // "日本語" in UTF-8
const char utf16String1[] = "\xE5\x65\x2C\x67\x9E\x8a"; // "日本語" in UTF-16LE
SmallVector<char, 255> buffer;
StringView input, output, expected;
input = StringView({utf8String1, sizeof(utf8String1) - 1}, false, StringEncoding::Utf8);
expected = StringView({utf16String1, sizeof(utf16String1) - 1}, false, StringEncoding::Utf16);
buffer.clear();
SC_TEST_EXPECT(StringConverter::convertEncodingToUTF16(input, buffer, &output, StringConverter::AddZeroTerminator));
SC_TEST_EXPECT(output == expected);
input = StringView({utf16String1, sizeof(utf16String1) - 1}, false, StringEncoding::Utf16);
expected = StringView({utf8String1, sizeof(utf8String1) - 1}, false, StringEncoding::Utf8);
buffer.clear();
StringConverter::convertEncodingToUTF8(input, buffer, &output, StringConverter::DoNotAddZeroTerminator));
SC_TEST_EXPECT(output == expected);

StringAlgorithms

Algorithms operating on strings (glob / wildcard).
Example

SC_ASSERT(StringAlgorithms::matchWildcard("", ""));
SC_ASSERT(StringAlgorithms::matchWildcard("1?3", "123"));
SC_ASSERT(StringAlgorithms::matchWildcard("1*3", "12223"));
SC_ASSERT(StringAlgorithms::matchWildcard("*2", "12"));
SC_ASSERT(not StringAlgorithms::matchWildcard("*1", "12"));
SC_ASSERT(not StringAlgorithms::matchWildcard("*1", "112"));
SC_ASSERT(not StringAlgorithms::matchWildcard("**1", "112"));
SC_ASSERT(not StringAlgorithms::matchWildcard("*?1", "112"));
SC_ASSERT(StringAlgorithms::matchWildcard("1*", "12123"));
SC_ASSERT(StringAlgorithms::matchWildcard("*/myString", "myString/myString/myString"));
SC_ASSERT(StringAlgorithms::matchWildcard("**/myString", "myString/myString/myString"));
SC_ASSERT(not StringAlgorithms::matchWildcard("*/String", "myString/myString/myString"));
SC_ASSERT(StringAlgorithms::matchWildcard("*/Directory/File.cpp", "/Root/Directory/File.cpp"));

Console

Writes to console using SC::StringFormat. Example:

// Create a buffer used for UTF conversions (if necessary)
SmallVector<char, 512 * sizeof(native_char_t)> consoleConversionBuffer;
// Construct console with the buffer
String str = StringView("Test Test\n");
// Have fun printing
console.print(str.view());
char native_char_t
The native char for the platform (wchar_t (4 bytes) on Windows, char (1 byte) everywhere else )
Definition: PrimitiveTypes.h:34

Implementation

A design choice of the library is that strings cannot be modified. Strings are either read-only (SC::StringView) or they need to be built from scratch with SC::StringBuilder. Another design choice is to support different encodings (ASCII, UTF8 or UTF16). The reason is that ASCII is efficient when it's known that the strings manipulated have Code Points made of a single byte. UTF8 is useful on Posix platforms and UTF16 is needed because that's the default encoding used by Win32 API. All functions interacting with filesystem, for example the ones in FileSystem or FileSystemIterator, return strings in the operating system native encoding. This means that on windows they will be UTF16 strings and on Apple Devices (or Linux) they are UTF8.

Roadmap

We need to understand if we want to allow iterating grapheme clusters (perceived end-user 'characters') or advanced capabilities like normalization and uppercase / lowercase conversions. As doing these operations from scratch is non trivial we will investigate if there OS functions allowing to achieve that functionality

🟦 Complete Features:

  • UTF Normalization
  • UTF Case Conversion

💡 Unplanned Features:

  • UTF word breaking
  • Grapheme Cluster iteration