re2r package is part of the Google Summer of Code 2016 Project.
The Regular Expression Denial of Service
The regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression that takes a very long time to evaluate.
On July 20, Stack Overflow experienced a service outage . This blog post explained the direct cause of the outage. A regular expression, which is used to trim Unicode space from start and end of a line, consume high CPU on the web servers.
Regular expression is a big part of the text analysis in R. Do we need to worry about ReDoS?
Regular Expressions in R
R provides two types of regular expressions in base package, extended regular expressions (the default) with TRE and Perl-like regular expressions used by
perl = TRUE with PCRE.
PCRE includes useful features, such as named capture, but it uses a backtracking algorithm, so it is easy to take exponential time or arbitrary stack depth for certain regular expressions. Using PCRE in the service backend would have left it open to easy denial of service attacks.
TRE has a polynomial time complexity but does not include named capture.
stringi is an R package use the regular expression engine from the ICU library, which has an exponential time complexity. The stringi package does not support named capture yet because it is still considered as experimental in ICU.
RE2 is a primarily DFA based regular expression engine from Google that is fast at matching large amounts of text with named capture. Users can build fast and scalable service backend with RE2 library. RE2 library has fewer syntaxes support than the ICU library. It does not support backtracking, but it gains speed and safety.
re2r package is the interface to the RE2 library in R. There are mainly 10 methods you can use with regular expression. These methods are similar to the popular
stringr. If you are familiar with this two package, it will be easy to learn this package. These methods are
With the help of the
re2r handle UTF-8 string very well. All input strings will be checked by the stringi package to ensure the strings are valid UTF-8 encoded.
re2r package provides the R community the first regular expression package with both named capture group feature and polynomial time complexity. If you need to use regexes in R for web APIs, this package will provide a safe regular expressions interface for your services. For example, this package will be useful for Shiny application on the internet.
To reproduce the stackoverflow.com outage example, we run a benchmark on these difference kind of regular expression packages. The source code of the benchmark can be view on here.
The regular expression was:
^[\s\u200c]+|[\s\u200c]+$ Which is intended to trim Unicode space from start and end of a line. A simplified version of the Regex that exposes the same issue would be
\s+$ which to a human looks easy, all of the spaces at the end of the string, but which means quite some work for a simple backtracking regex engine.
In this benchmark, ICU library and PCRE library has very bad performance when the string size increase.
If you want to know more about the package performance, you can find more benchmarks on the
re2r benchmark page.
If you are familiar with regular expressions, you can check out the vignettes and package help page o learn how to use the R functions. These functions are similar to
If you are new to regular expressions, please check out Learn Regular Expression with re2r on GitHub. You can read the tutorial or use
swirl package to learn by doing.
Here are the total links to the package resources:
This package is not on CRAN for now, because CRAN now uses GCC 6.1.1 that contains bug and it can not compile RE2 library for now. More detail on the RE2 issue tracker. This package will be published to CRAN, when CRAN updates the GCC compiler.
- GitHub Repo - https://github.com/qinwf/re2r
- Introduction to re2r - Package Vignettes.
- Learn Regular Expression with re2r - Tutorial for people who is new to regular expression.
- Syntax Table - List all of the syntax supported by RE2 library.
- Benchmarks - Benchmark with other R packages.
If you find any issue about this package, please let me know on GitHub.
In the last part of the post, I want to thank Toby Dylan Hocking and Marek Gagolewski for their ideas and help for developing this package. And I also want to thank Google for providing this event for me to participate in the R open source community.RE2 rstats