Efficiently escaping strings using Cow in Rust

This is a handy pattern to use for efficiently escaping text, and it's also a good demonstration of Rust's Cow πŸ„ type.

Basically, if we need to escape some text (to include in HTML or a CSV file or whatever) we might need to return a new String which has all of the special characters in the original replaced with the appropriate escape sequences. But probably most of our input strings won't actually need to be escaped, so how can we avoid allocating a bunch of strings unnecessarily in those cases?

If the string needs escaping we want to return a brand new String, but if it doesn't we want to return a reference to the original (&str) and avoid any memory allocations altogether. std::borrow::Cow is handy for this. It's just a simple enum of either some borrowed data (&str in our case), or the owned equivalent of that data (String).

The gist of our algorithm is this:

  • Iterate through the characters of the string until you find one that needs escaping.
  • If you don’t find any special characters just return the pre-existing string with Cow::Borrowed(input) (no new allocations needed πŸŽ‰).
  • If you do find a special character then:
    • Copy the characters up to that point into a new mutable String (we already know the characters that come before the first special character are safe because we checked them already).
    • Iterate through the remaining characters, escaping them as needed, and then appending them to our new String.
    • Return the new string using Cow::Owned(escaped_string).

Here's an example escaping HTML characters:

use std::borrow::Cow;

pub fn html_escape(input: &str) -> Cow<str> {
// Iterate through the characters, checking if each one needs escaping
for (i, ch) in input.chars().enumerate() {
if html_escape_char(ch).is_some() {
// At least one char needs escaping, so we need to return a brand
// new `String` rather than the original

let mut escaped_string = String::with_capacity(input.len());
// Calling `String::with_capacity()` instead of `String::new()` is
// a slight optimisation to reduce the number of allocations we
// need to do.
//
// We know that the escaped string is always at least as long as
// the unescaped version so we can preallocate at least that much
// space.

// We already checked the characters up to index `i` don't need
// escaping so we can just copy them straight in
escaped_string.push_str(&input[..i]);

// Escape the remaining characters if they need it and add them to
// our escaped string
for ch in input[i..].chars() {
match html_escape_char(ch) {
Some(escaped_char) => escaped_string.push_str(escaped_char),
None => escaped_string.push(ch),
};
}

return Cow::Owned(escaped_string);
}
}

// We've iterated through all of `input` and didn't find any special
// characters, so it's safe to just return the original string
Cow::Borrowed(input)
}

fn html_escape_char(ch: char) -> Option<&'static str> {
match ch {
'&' => Some("&amp"),
'<' => Some("&lt;"),
'>' => Some("&gt;"),
'"' => Some("&quot;"),
'\'' => Some("&#x27;"),
_ => None,
}
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn returns_text_that_does_not_need_escaping_as_is() {
let input = "This is a safe string!";

let escaped = html_escape(input);

assert_eq!(escaped, Cow::Borrowed(input));
}

#[test]
fn escapes_text_containing_html_special_characters() {
let input = "This is a <script>alert('nasty');</script> string";

let expected: Cow<str> = Cow::Owned(
"This is a &lt;script&gt;alert(&#x27;nasty&#x27;);&lt;/script&gt; string".to_string(),
);

let escaped = html_escape(input);

assert_eq!(escaped, expected);
}
}