Skip to content

Commit aee5583

Browse files
authored
nanoparquet 0.4.0 post (#722)
1 parent 58e5b13 commit aee5583

File tree

4 files changed

+495
-0
lines changed

4 files changed

+495
-0
lines changed
Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
---
2+
output: hugodown::hugo_document
3+
4+
slug: nanoparquet-0-4-0
5+
title: nanoparquet 0.4.0
6+
date: 2025-01-28
7+
author: Gábor Csárdi
8+
description: >
9+
nanoparquet 0.4.0 comes with a new and much faster `read_parquet()`,
10+
configurable type mappings in `write_parquet()`, and a new
11+
`append_parquet()`.
12+
13+
photo:
14+
url: https://www.pexels.com/photo/person-running-in-the-hallway-796545/
15+
author: Michael Foster
16+
17+
# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other"
18+
categories: [package]
19+
tags: [parquet]
20+
---
21+
22+
<!--
23+
TODO:
24+
* [x] Look over / edit the post's title in the yaml
25+
* [x] Edit (or delete) the description; note this appears in the Twitter card
26+
* [x] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
27+
* [x] Find photo & update yaml metadata
28+
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
29+
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
30+
* [x] `hugodown::use_tidy_thumbnails()`
31+
* [x] Add intro sentence, e.g. the standard tagline for the package
32+
* [x] `usethis::use_tidy_thanks()`
33+
-->
34+
35+
We're thrilled to announce the release of [nanoparquet](
36+
https://nanoparquet.r-lib.org/) 0.4.0. nanoparquet is an R package that
37+
reads and writes Parquet files.
38+
39+
You can install it from CRAN with:
40+
41+
```r
42+
install.packages("nanoparquet")
43+
```
44+
45+
This blog post will show the most important new features of nanoparquet
46+
0.4.0: You can see a full list of changes in the [release notes](
47+
https://nanoparquet.r-lib.org/news/index.html#nanoparquet-040).
48+
49+
## Brand new `read_parquet()`
50+
51+
nanoparquet 0.4.0 comes with a completely rewritten Parquet reader.
52+
The new version has an architecture that is easier to embed into R, and
53+
facilitates fantastic new features, like `append_parquet()` and the new
54+
`col_select` argument. (More to come!) The new reader is also much faster,
55+
see the "Benchmarks" chapter.
56+
57+
## Read a subset of columns
58+
59+
`read_parquet()` now has a new argument called `col_select`, that lets you
60+
read a subset of the columns from the Parquet file. Unlike for row oriented
61+
file formats like CSV, this means that the reader never needs to touch the
62+
columns that are not needed for. The time required for reading a subset of
63+
columns is independent of how many more columns the Parquet file might
64+
have!
65+
66+
You can either use column indices or column names to specify the columns
67+
to read. Here is an example.
68+
69+
```{r setup}
70+
library(nanoparquet)
71+
library(pillar)
72+
```
73+
74+
```{r include = FALSE}
75+
if (!file.exists("flights.parquet")) {
76+
write_parquet(nycflights13::flights, "flights.parquet")
77+
}
78+
```
79+
80+
This is the `nycflights13::flights` data set:
81+
82+
```{r col_select}
83+
read_parquet(
84+
"flights.parquet",
85+
col_select = c("dep_time", "arr_time", "carrier")
86+
)
87+
```
88+
89+
Use `read_parquet_schema()` if you want to see the structure of the Parquet
90+
file first:
91+
92+
```{r read_parquet_schema}
93+
read_parquet_schema("flights.parquet")
94+
```
95+
96+
The output of `read_parquet_schema()` also shows you the R type that
97+
nanoparquet will use for each column.
98+
99+
## Appending to Parquet files
100+
101+
The new `append_parquet()` function makes it easy to append new data to
102+
a Parquet file, without first reading the whole file into memory.
103+
The schema of the file and the schema new data must match of course. Lets
104+
merge `nycflights13::flights` and `nycflights23::flights`:
105+
106+
```{r append_parquet}
107+
file.copy("flights.parquet", "allflights.parquet", overwrite = TRUE)
108+
append_parquet(nycflights23::flights, "allflights.parquet")
109+
```
110+
111+
`read_parquet_info()` returns the most basic information about a Parquet
112+
file:
113+
114+
```{r read_parquet_info}
115+
read_parquet_info("flights.parquet")
116+
read_parquet_info("allflights.parquet")
117+
```
118+
119+
Note that you should probably still create a backup copy of the original
120+
file when using `append_parquet()`. If the appending process is interrupted
121+
by a power failure, then you might end up with an incomplete and invalid
122+
Parquet file.
123+
124+
## Schemas and type conversions
125+
126+
In nanoparquet 0.4.0 `write_parquet()` takes a `schema` argument that
127+
can customize the R to Parquet type mappings. For example by default
128+
`write_parquet()` writes an R character vector as a `STRING` Parquet type.
129+
If you'd like to write a certain character column as an `ENUM`
130+
type^[A Parquet `ENUM` type is very similar to a factor in R.]
131+
instead, you'll need to specify that in `schema`:
132+
133+
```{r schema}
134+
write_parquet(
135+
nycflights13::flights,
136+
"newflights.parquet",
137+
schema = parquet_schema(carrier = "ENUM")
138+
)
139+
read_parquet_schema("newflights.parquet")
140+
```
141+
142+
Here we wrote the `carrier` column as `ENUM`, and left the other other
143+
columns to use the default type mappings.
144+
145+
See the [`?nanoparquet-types`](
146+
https://nanoparquet.r-lib.org/reference/nanoparquet-types.html#r-s-data-types
147+
) manual page for the possible type mappings (lots of new ones!) and also
148+
for the default ones.
149+
150+
## Encodings
151+
152+
It is now also possible to customize the encoding of each column in
153+
`write_parquet()`, via the `encoding` argument. By default
154+
`write_parquet()` tries to choose a good encoding based on the type and the
155+
values of a column. E.g. it checks a small sample for repeated values to
156+
decide if it is worth using a dictionary encoding (`RLE_DICTIONARY`).
157+
158+
If `write_parquet()` gets it wrong, use the `encoding` argument to force an
159+
encoding. The following forces the `PLAIN` encoding for all columns. This
160+
encoding is very fast to write, but creates a larger file. You can also
161+
specify different encodings for different columns, see the
162+
[`write_parquet()` manual page](
163+
https://nanoparquet.r-lib.org/reference/write_parquet.html).
164+
165+
```{r encoding}
166+
write_parquet(
167+
nycflights13::flights,
168+
"plainflights.parquet",
169+
encoding = "PLAIN"
170+
)
171+
file.size("flights.parquet")
172+
file.size("plainflights.parquet")
173+
```
174+
175+
See more about the implemented encodings and how the defaults are
176+
selected in the [`parquet-encodings` manual page](
177+
https://nanoparquet.r-lib.org/reference/parquet-encodings.html).
178+
179+
## API changes
180+
181+
Some nanoparquet functions have new, better names in nanoparquet 0.4.0.
182+
In particular, all functions that read from a Parquet file have a
183+
`read_parquet` prefix now. The old functions still work, with a warning.
184+
185+
Also, the `parquet_schema()` function is now for creating a new Parquet
186+
schema from scratch, and not for inferring a schema from a data frame
187+
(use `infer_parquet_schema()`) or for reading the schema from a Parquet
188+
file (use `read_parquet_schema()`). `parquet_schema()` falls back to the
189+
old behaviour when called with a file name, with a warning, so this is not
190+
a breaking change (yet), and old code still works.
191+
192+
See the complete list of API changes in the [Changelog](
193+
https://nanoparquet.r-lib.org/news/index.html).
194+
195+
## Benchmarks
196+
197+
We are very excited about the performance of the new Parquet reader, and
198+
the Parquet writer was always quite speedy, so we ran a simple benchmark.
199+
200+
We compared nanoparquet to the Parquet implementations in Apache Arrow and
201+
DuckDB, and also to CSV readers and writers, on a real data set, for
202+
samples of 330k, 6.7 million and 67.4 million rows (40MB, 800MB and 8GB in
203+
memory). For these data nanoparquet is indeed very competitive with both
204+
Arrow and DuckDB.
205+
206+
You can see the full results [on the website](
207+
https://nanoparquet.r-lib.org/articles/benchmarks.html).
208+
209+
## Other changes
210+
211+
Other important changes in nanoparquet 0.4.0 include:
212+
213+
* `write_parquet()` can now write multiple row groups. By default it puts
214+
at most 10 million rows in a single row group. (This is subject to
215+
https://nanoparquet.r-lib.org/references/parquet_options.html
216+
) on how to change the default.
217+
218+
* `write_parquet()` now writes minimum and maximum statistics (by default)
219+
for most Parquet types. See the [`parquet_options()` manual page](
220+
https://nanoparquet.r-lib.org/reference/parquet_options.html
221+
) on how to turn this off, which will probably make the writer faster.
222+
223+
* `write_parquet()` can now write version 2 data pages. The default is
224+
still version 1, but it might change in the future.
225+
226+
* New `compression_level` option to select the compression level manually.
227+
228+
* `read_parquet()` can now read from an R connection.
229+
230+
## Acknowledgements
231+
232+
[&#x0040;alvarocombo](https://github.com/alvarocombo), [&#x0040;D3SL](https://github.com/D3SL), [&#x0040;gaborcsardi](https://github.com/gaborcsardi), and [&#x0040;RealTYPICAL](https://github.com/RealTYPICAL).

0 commit comments

Comments
 (0)