After the mid-term evaluation, things slowed down a bit in terms of merged PRs — but not in terms of learning and contribution. These weeks were heavy on technical discussions, design exploration, and one major feature PR that’s still in progress. Here’s what I worked on:
1. Variable Scoping in by
with Character Vectors (#3178)
This discussion started from an observation:
v5 = "z"
data.table(v1 = 1)[, .N, by = "v5"]
Currently returns:
v5 N
1: z 1
even though “v5” isn’t a column. The character “v5” is being treated as a variable from the parent scope — something that can cause subtle bugs if a new column appears with the same name.
We explored whether we should:
- Keep this behavior for backward compatibility
- Add a warning for future releases to give downstream packages time to adapt
- Extend the warning to cases where symbols in
.(...)
also unexpectedly scope to parent variables
so we decide warnings are worth implementing, but they’ll likely land in a future major release to avoid surprises.
2. Auto-Naming in shift()
with give.names = TRUE
(#3905)
The current shift()
output names look like:
V1_lead_1 V1_lag_1 V2_lead_1 V2_lag_1
But when you pass named inputs like . (latitude, longitude)
, it feels more natural to expect:
latitude_lead_1 latitude_lag_1 longitude_lead_1 longitude_lag_1
We discussed a generic fix:
- Preprocess calls like
.(x, y)
into.(x = x, y = y)
early in evaluation - This would benefit
shift()
and other functions without per-function hacks
I suggested implementing this in replace_dot_alias()
, but maintainers decided to hold off, as it would be a breaking change late in the release cycle.
3. fwrite()
Gains a select
Parameter — In Progress (#4177 /#7236)
This was the most hands-on work of the period. The issue was:
fwrite(DT[, .(a, c)]) # creates a full in-memory copy of the subset
On huge datasets, that’s expensive or impossible.
I proposed adding a select
parameter so you can write specific columns without creating a temporary object:
fwrite(DT, "file.csv", select = c("a", "c"))
For data.table
inputs, the implementation uses .shallow()
to create a shallow copy referencing only the selected columns — no data duplication. For other inputs (data.frame, list, matrix), we subset directly.
The PR includes:
- Unit tests for all supported input types
- Benchmarks showing reduced memory use
- Documentation + NEWS entry
The maintainers and I have been refining the approach, so this is still open.
4. Selective Column Drops in CJ()
(#5061)
This one was more of a conceptual discussion. The idea: allow CJ()
- cross join to drop or select certain columns without forcing the user to do a full subset afterward.
We explored possible , trade-offs, and whether it should be part of CJ()
or handled via downstream subsetting. While we didn’t finalize an implementation, it gave me a better grasp of where certain logic belongs in the codebase vs. user land.
Wrapping Up Weeks 7–9
These three weeks might not look flashy in terms of merged PRs, but they were packed with discussions, breaking-change considerations, and one feature PR that’s already showing real promise.
For me one of the most valuable progress happens in discussions where I question about behaviors, weigh backward compatibility against usability, and refine the smallest details to make sure the change is both correct and worth shipping.
— Mukul