7-9thWeek of GSoC | R-GSoC 2025 blog

After the mid-term evaluation, things slowed down a bit in terms of merged PRs — but not in terms of learning and contribution. These weeks were heavy on technical discussions, design exploration, and one major feature PR that’s still in progress. Here’s what I worked on:

1. Variable Scoping in `by` with Character Vectors (#3178)

This discussion started from an observation:

v5 = "z"
data.table(v1 = 1)[, .N, by = "v5"]

Currently returns:

	v5 N
1:   z 1

even though “v5” isn’t a column. The character “v5” is being treated as a variable from the parent scope — something that can cause subtle bugs if a new column appears with the same name.

We explored whether we should:

Keep this behavior for backward compatibility
Add a warning for future releases to give downstream packages time to adapt
Extend the warning to cases where symbols in .(...) also unexpectedly scope to parent variables

so we decide warnings are worth implementing, but they’ll likely land in a future major release to avoid surprises.

2. Auto-Naming in `shift()` with `give.names = TRUE` (#3905)

The current shift() output names look like:

V1_lead_1  V1_lag_1  V2_lead_1  V2_lag_1

But when you pass named inputs like . (latitude, longitude), it feels more natural to expect:

latitude_lead_1  latitude_lag_1  longitude_lead_1  longitude_lag_1

We discussed a generic fix:

Preprocess calls like .(x, y) into .(x = x, y = y) early in evaluation
This would benefit shift() and other functions without per-function hacks

I suggested implementing this in replace_dot_alias(), but maintainers decided to hold off, as it would be a breaking change late in the release cycle.

3. `fwrite()` Gains a `select` Parameter — In Progress (#4177 /#7236)

This was the most hands-on work of the period. The issue was:

fwrite(DT[, .(a, c)])  # creates a full in-memory copy of the subset

On huge datasets, that’s expensive or impossible.

I proposed adding a select parameter so you can write specific columns without creating a temporary object:

fwrite(DT, "file.csv", select = c("a", "c"))

For data.table inputs, the implementation uses .shallow() to create a shallow copy referencing only the selected columns — no data duplication. For other inputs (data.frame, list, matrix), we subset directly.

The PR includes:

Unit tests for all supported input types
Benchmarks showing reduced memory use
Documentation + NEWS entry

The maintainers and I have been refining the approach, so this is still open.

4. Selective Column Drops in `CJ()` (#5061)

This one was more of a conceptual discussion. The idea: allow CJ() - cross join to drop or select certain columns without forcing the user to do a full subset afterward.

We explored possible , trade-offs, and whether it should be part of CJ() or handled via downstream subsetting. While we didn’t finalize an implementation, it gave me a better grasp of where certain logic belongs in the codebase vs. user land.

Wrapping Up Weeks 7–9

These three weeks might not look flashy in terms of merged PRs, but they were packed with discussions, breaking-change considerations, and one feature PR that’s already showing real promise.

For me one of the most valuable progress happens in discussions where I question about behaviors, weigh backward compatibility against usability, and refine the smallest details to make sure the change is both correct and worth shipping.

— Mukul

Tags: R GSoC Google data.table

1. Variable Scoping in by with Character Vectors (#3178)

2. Auto-Naming in shift() with give.names = TRUE (#3905)

3. fwrite() Gains a select Parameter — In Progress (#4177 /#7236)

4. Selective Column Drops in CJ() (#5061)

Wrapping Up Weeks 7–9

1. Variable Scoping in `by` with Character Vectors (#3178)

2. Auto-Naming in `shift()` with `give.names = TRUE` (#3905)

3. `fwrite()` Gains a `select` Parameter — In Progress (#4177 /#7236)

4. Selective Column Drops in `CJ()` (#5061)