20 Aug 2016

The re2r package is part of the Google Summer of Code 2016 Project.

The Regular Expression Denial of Service

The regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression that takes a very long time to evaluate.

On July 20, Stack Overflow experienced a service outage . This blog post explained the direct cause of the outage. A regular expression, which is used to trim Unicode space from start and end of a line, consume high CPU on the web servers.

Regular expression is a big part of the text analysis in R. Do we need to worry about ReDoS?

stack

Regular Expressions in R

R provides two types of regular expressions in base package, extended regular expressions (the default) with TRE and Perl-like regular expressions used by perl = TRUE with PCRE.

PCRE includes useful features, such as named capture, but it uses a backtracking algorithm, so it is easy to take exponential time or arbitrary stack depth for certain regular expressions. Using PCRE in the service backend would have left it open to easy denial of service attacks.

TRE has a polynomial time complexity but does not include named capture.

stringi is an R package use the regular expression engine from the ICU library, which has an exponential time complexity. The stringi package does not support named capture yet because it is still considered as experimental in ICU.

About re2r

RE2 is a primarily DFA based regular expression engine from Google that is fast at matching large amounts of text with named capture. Users can build fast and scalable service backend with RE2 library. RE2 library has fewer syntaxes support than the ICU library. It does not support backtracking, but it gains speed and safety.

re2r package is the interface to the RE2 library in R. There are mainly 10 methods you can use with regular expression. These methods are similar to the popular stringi and stringr. If you are familiar with this two package, it will be easy to learn this package. These methods are detect, match, extract, count, replace, split, locate, quote, show, and compile.

With the help of the stringi package, re2r handle UTF-8 string very well. All input strings will be checked by the stringi package to ensure the strings are valid UTF-8 encoded.

re2r Performance

re2r package provides the R community the first regular expression package with both named capture group feature and polynomial time complexity. If you need to use regexes in R for web APIs, this package will provide a safe regular expressions interface for your services. For example, this package will be useful for Shiny application on the internet.

To reproduce the stackoverflow.com outage example, we run a benchmark on these difference kind of regular expression packages. The source code of the benchmark can be view on here.

stack

The regular expression was: ^[\s\u200c]+|[\s\u200c]+$ Which is intended to trim Unicode space from start and end of a line. A simplified version of the Regex that exposes the same issue would be \s+$ which to a human looks easy, all of the spaces at the end of the string, but which means quite some work for a simple backtracking regex engine.

In this benchmark, ICU library and PCRE library has very bad performance when the string size increase.

If you want to know more about the package performance, you can find more benchmarks on the re2r benchmark page.

Learn re2r

If you are familiar with regular expressions, you can check out the vignettes and package help page o learn how to use the R functions. These functions are similar to stringr and stringi packages.

If you are new to regular expressions, please check out Learn Regular Expression with re2r on GitHub. You can read the tutorial or use swirl package to learn by doing.

Here are the total links to the package resources:

This package is not on CRAN for now, because CRAN now uses GCC 6.1.1 that contains bug and it can not compile RE2 library for now. More detail on the RE2 issue tracker. This package will be published to CRAN, when CRAN updates the GCC compiler.

If you find any issue about this package, please let me know on GitHub.

In the last part of the post, I want to thank Toby Dylan Hocking and Marek Gagolewski for their ideas and help for developing this package. And I also want to thank Google for providing this event for me to participate in the R open source community.

RE2 rstats

12 Aug 2016

CRAN submission can be hard for the first time CRAN contributor.

Here is todo list CRAN submission, based on What I learn from my experience and the resourses on the Internet.

Title case - Package title should be in title case.

Authors - Make sure you list all of the authors about the code you use on your package.

Run R CMD check with --as-cran.

If your package contains compiled code, you should check every platform that you can reach to ensure the package can compile on CRAN.

Write CRAN comments about all the note you find on R CMD check.

Test your packge with valgrind and Undefined Behavior Sanitizer.

CRAN now tests package with valgrind, UBSAN and ASAN now and then. Make use your package pass these tests before they are on CRAN, this will reduce the trouble you face, when the package is on CRAN.

re2r CRAN rstats