|
| 1 | +--- |
| 2 | +output: hugodown::hugo_document |
| 3 | + |
| 4 | +slug: nanoparquet-0-4-0 |
| 5 | +title: nanoparquet 0.4.0 |
| 6 | +date: 2025-01-28 |
| 7 | +author: Gábor Csárdi |
| 8 | +description: > |
| 9 | + nanoparquet 0.4.0 comes with a new and much faster `read_parquet()`, |
| 10 | + configurable type mappings in `write_parquet()`, and a new |
| 11 | + `append_parquet()`. |
| 12 | +
|
| 13 | +photo: |
| 14 | + url: https://www.pexels.com/photo/person-running-in-the-hallway-796545/ |
| 15 | + author: Michael Foster |
| 16 | + |
| 17 | +# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" |
| 18 | +categories: [package] |
| 19 | +tags: [parquet] |
| 20 | +--- |
| 21 | + |
| 22 | +<!-- |
| 23 | +TODO: |
| 24 | +* [x] Look over / edit the post's title in the yaml |
| 25 | +* [x] Edit (or delete) the description; note this appears in the Twitter card |
| 26 | +* [x] Pick category and tags (see existing with `hugodown::tidy_show_meta()`) |
| 27 | +* [x] Find photo & update yaml metadata |
| 28 | +* [x] Create `thumbnail-sq.jpg`; height and width should be equal |
| 29 | +* [x] Create `thumbnail-wd.jpg`; width should be >5x height |
| 30 | +* [x] `hugodown::use_tidy_thumbnails()` |
| 31 | +* [x] Add intro sentence, e.g. the standard tagline for the package |
| 32 | +* [x] `usethis::use_tidy_thanks()` |
| 33 | +--> |
| 34 | + |
| 35 | +We're thrilled to announce the release of [nanoparquet]( |
| 36 | + https://nanoparquet.r-lib.org/) 0.4.0. nanoparquet is an R package that |
| 37 | +reads and writes Parquet files. |
| 38 | + |
| 39 | +You can install it from CRAN with: |
| 40 | + |
| 41 | +```r |
| 42 | +install.packages("nanoparquet") |
| 43 | +``` |
| 44 | + |
| 45 | +This blog post will show the most important new features of nanoparquet |
| 46 | +0.4.0: You can see a full list of changes in the [release notes]( |
| 47 | + https://nanoparquet.r-lib.org/news/index.html#nanoparquet-040). |
| 48 | + |
| 49 | +## Brand new `read_parquet()` |
| 50 | + |
| 51 | +nanoparquet 0.4.0 comes with a completely rewritten Parquet reader. |
| 52 | +The new version has an architecture that is easier to embed into R, and |
| 53 | +facilitates fantastic new features, like `append_parquet()` and the new |
| 54 | +`col_select` argument. (More to come!) The new reader is also much faster, |
| 55 | +see the "Benchmarks" chapter. |
| 56 | + |
| 57 | +## Read a subset of columns |
| 58 | + |
| 59 | +`read_parquet()` now has a new argument called `col_select`, that lets you |
| 60 | +read a subset of the columns from the Parquet file. Unlike for row oriented |
| 61 | +file formats like CSV, this means that the reader never needs to touch the |
| 62 | +columns that are not needed for. The time required for reading a subset of |
| 63 | +columns is independent of how many more columns the Parquet file might |
| 64 | +have! |
| 65 | + |
| 66 | +You can either use column indices or column names to specify the columns |
| 67 | +to read. Here is an example. |
| 68 | + |
| 69 | +```{r setup} |
| 70 | +library(nanoparquet) |
| 71 | +library(pillar) |
| 72 | +``` |
| 73 | + |
| 74 | +```{r include = FALSE} |
| 75 | +if (!file.exists("flights.parquet")) { |
| 76 | + write_parquet(nycflights13::flights, "flights.parquet") |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | +This is the `nycflights13::flights` data set: |
| 81 | + |
| 82 | +```{r col_select} |
| 83 | +read_parquet( |
| 84 | + "flights.parquet", |
| 85 | + col_select = c("dep_time", "arr_time", "carrier") |
| 86 | +) |
| 87 | +``` |
| 88 | + |
| 89 | +Use `read_parquet_schema()` if you want to see the structure of the Parquet |
| 90 | +file first: |
| 91 | + |
| 92 | +```{r read_parquet_schema} |
| 93 | +read_parquet_schema("flights.parquet") |
| 94 | +``` |
| 95 | + |
| 96 | +The output of `read_parquet_schema()` also shows you the R type that |
| 97 | +nanoparquet will use for each column. |
| 98 | + |
| 99 | +## Appending to Parquet files |
| 100 | + |
| 101 | +The new `append_parquet()` function makes it easy to append new data to |
| 102 | +a Parquet file, without first reading the whole file into memory. |
| 103 | +The schema of the file and the schema new data must match of course. Lets |
| 104 | +merge `nycflights13::flights` and `nycflights23::flights`: |
| 105 | + |
| 106 | +```{r append_parquet} |
| 107 | +file.copy("flights.parquet", "allflights.parquet", overwrite = TRUE) |
| 108 | +append_parquet(nycflights23::flights, "allflights.parquet") |
| 109 | +``` |
| 110 | + |
| 111 | +`read_parquet_info()` returns the most basic information about a Parquet |
| 112 | +file: |
| 113 | + |
| 114 | +```{r read_parquet_info} |
| 115 | +read_parquet_info("flights.parquet") |
| 116 | +read_parquet_info("allflights.parquet") |
| 117 | +``` |
| 118 | + |
| 119 | +Note that you should probably still create a backup copy of the original |
| 120 | +file when using `append_parquet()`. If the appending process is interrupted |
| 121 | +by a power failure, then you might end up with an incomplete and invalid |
| 122 | +Parquet file. |
| 123 | + |
| 124 | +## Schemas and type conversions |
| 125 | + |
| 126 | +In nanoparquet 0.4.0 `write_parquet()` takes a `schema` argument that |
| 127 | +can customize the R to Parquet type mappings. For example by default |
| 128 | +`write_parquet()` writes an R character vector as a `STRING` Parquet type. |
| 129 | +If you'd like to write a certain character column as an `ENUM` |
| 130 | +type^[A Parquet `ENUM` type is very similar to a factor in R.] |
| 131 | +instead, you'll need to specify that in `schema`: |
| 132 | + |
| 133 | +```{r schema} |
| 134 | +write_parquet( |
| 135 | + nycflights13::flights, |
| 136 | + "newflights.parquet", |
| 137 | + schema = parquet_schema(carrier = "ENUM") |
| 138 | +) |
| 139 | +read_parquet_schema("newflights.parquet") |
| 140 | +``` |
| 141 | + |
| 142 | +Here we wrote the `carrier` column as `ENUM`, and left the other other |
| 143 | +columns to use the default type mappings. |
| 144 | + |
| 145 | +See the [`?nanoparquet-types`]( |
| 146 | + https://nanoparquet.r-lib.org/reference/nanoparquet-types.html#r-s-data-types |
| 147 | +) manual page for the possible type mappings (lots of new ones!) and also |
| 148 | +for the default ones. |
| 149 | + |
| 150 | +## Encodings |
| 151 | + |
| 152 | +It is now also possible to customize the encoding of each column in |
| 153 | +`write_parquet()`, via the `encoding` argument. By default |
| 154 | +`write_parquet()` tries to choose a good encoding based on the type and the |
| 155 | +values of a column. E.g. it checks a small sample for repeated values to |
| 156 | +decide if it is worth using a dictionary encoding (`RLE_DICTIONARY`). |
| 157 | + |
| 158 | +If `write_parquet()` gets it wrong, use the `encoding` argument to force an |
| 159 | +encoding. The following forces the `PLAIN` encoding for all columns. This |
| 160 | +encoding is very fast to write, but creates a larger file. You can also |
| 161 | +specify different encodings for different columns, see the |
| 162 | +[`write_parquet()` manual page]( |
| 163 | + https://nanoparquet.r-lib.org/reference/write_parquet.html). |
| 164 | + |
| 165 | +```{r encoding} |
| 166 | +write_parquet( |
| 167 | + nycflights13::flights, |
| 168 | + "plainflights.parquet", |
| 169 | + encoding = "PLAIN" |
| 170 | +) |
| 171 | +file.size("flights.parquet") |
| 172 | +file.size("plainflights.parquet") |
| 173 | +``` |
| 174 | + |
| 175 | +See more about the implemented encodings and how the defaults are |
| 176 | +selected in the [`parquet-encodings` manual page]( |
| 177 | + https://nanoparquet.r-lib.org/reference/parquet-encodings.html). |
| 178 | + |
| 179 | +## API changes |
| 180 | + |
| 181 | +Some nanoparquet functions have new, better names in nanoparquet 0.4.0. |
| 182 | +In particular, all functions that read from a Parquet file have a |
| 183 | +`read_parquet` prefix now. The old functions still work, with a warning. |
| 184 | + |
| 185 | +Also, the `parquet_schema()` function is now for creating a new Parquet |
| 186 | +schema from scratch, and not for inferring a schema from a data frame |
| 187 | +(use `infer_parquet_schema()`) or for reading the schema from a Parquet |
| 188 | +file (use `read_parquet_schema()`). `parquet_schema()` falls back to the |
| 189 | +old behaviour when called with a file name, with a warning, so this is not |
| 190 | +a breaking change (yet), and old code still works. |
| 191 | + |
| 192 | +See the complete list of API changes in the [Changelog]( |
| 193 | + https://nanoparquet.r-lib.org/news/index.html). |
| 194 | + |
| 195 | +## Benchmarks |
| 196 | + |
| 197 | +We are very excited about the performance of the new Parquet reader, and |
| 198 | +the Parquet writer was always quite speedy, so we ran a simple benchmark. |
| 199 | + |
| 200 | +We compared nanoparquet to the Parquet implementations in Apache Arrow and |
| 201 | +DuckDB, and also to CSV readers and writers, on a real data set, for |
| 202 | +samples of 330k, 6.7 million and 67.4 million rows (40MB, 800MB and 8GB in |
| 203 | +memory). For these data nanoparquet is indeed very competitive with both |
| 204 | +Arrow and DuckDB. |
| 205 | + |
| 206 | +You can see the full results [on the website]( |
| 207 | + https://nanoparquet.r-lib.org/articles/benchmarks.html). |
| 208 | + |
| 209 | +## Other changes |
| 210 | + |
| 211 | +Other important changes in nanoparquet 0.4.0 include: |
| 212 | + |
| 213 | +* `write_parquet()` can now write multiple row groups. By default it puts |
| 214 | + at most 10 million rows in a single row group. (This is subject to |
| 215 | + https://nanoparquet.r-lib.org/references/parquet_options.html |
| 216 | + ) on how to change the default. |
| 217 | + |
| 218 | +* `write_parquet()` now writes minimum and maximum statistics (by default) |
| 219 | + for most Parquet types. See the [`parquet_options()` manual page]( |
| 220 | + https://nanoparquet.r-lib.org/reference/parquet_options.html |
| 221 | + ) on how to turn this off, which will probably make the writer faster. |
| 222 | + |
| 223 | +* `write_parquet()` can now write version 2 data pages. The default is |
| 224 | + still version 1, but it might change in the future. |
| 225 | + |
| 226 | +* New `compression_level` option to select the compression level manually. |
| 227 | + |
| 228 | +* `read_parquet()` can now read from an R connection. |
| 229 | + |
| 230 | +## Acknowledgements |
| 231 | + |
| 232 | +[@alvarocombo](https://github.com/alvarocombo), [@D3SL](https://github.com/D3SL), [@gaborcsardi](https://github.com/gaborcsardi), and [@RealTYPICAL](https://github.com/RealTYPICAL). |
0 commit comments