10–11thWeek of GSoC

By Mukul Kumar

Weeks 10–11 Progress

As the project nears its conclusion, I focused on two main tracks: refining my earlier fwrite(select=) PR and contributing consistency fixes in melt() and dcast() as part of Issue #6629.

1. Continued Work on fwrite(select=) (PR #7236 — Open)

In previous weeks, I introduced a new select argument to fwrite() that allows writing only specific columns without creating temporary objects. This greatly improves memory efficiency for large datasets.

During Weeks 10–11, I refined this PR based on performance discussions with maintainers. Using atime::atime_versions, we benchmarked fwrite() across different scenarios (varying rows vs. varying columns) on both Windows and Ubuntu.

Key findings:

  • When select is provided as character names, the cost of name lookup is O(cols), leading to linear time/memory when the number of columns is very large.
  • Passing numeric indices avoids this overhead, enabling near-constant time and memory in such cases.
  • A fix was proposed to bypass unnecessary shallow views and name resolution for numeric selections, reducing overhead.
  • Platform-specific differences (e.g., Windows allocator thresholds) were observed and explained as expected runtime behavior rather than correctness issues.

The PR remains open, with ongoing refinements around edge cases and the benchmarking harness.

2. Breaking Change in dcast() — PR #7260 (Merged)

Issue #6629 identified two inconsistencies in melt() and dcast(). The first concerned dcast(), where fun.aggregate must return a scalar. Previously:

  • If fill = NULL, non-scalar returns correctly errored.
  • If fill was non-NULL, it only warned but still produced undefined results.

PR #7260 makes behavior consistent:

  • fun.aggregate must always return length 1, regardless of fill.
  • The transitional warning was removed and undefined behavior eliminated.

This was merged, completing one half of Issue #6629.

3. Breaking Change in melt() — PR #7257 (In Progress)

The second inconsistency was with melt(). Docs stated that when measure.vars is provided as a list, the variable column should contain integer indices. However, when the list length was 1, it returned a character name instead.

PR #7257 makes this consistent:

  • For all list-based measure.vars (including length 1), variable now contains integer indices.
  • For character or integer vector measure.vars, behavior is unchanged (names are returned).
  • With variable.factor = TRUE, levels are '1', '2', … for list-based cases.

I initially changed code outside the list path; after mentor feedback, I reverted non-list edits and followed the approach of PR #5247, ensuring only the intended branch is modified.

This PR is under review and being refined with further feedback.

closing thoughts

The last few weeks were less about working on new features and more about refining, debugging, and aligning behavior with documentation. This phase pushed me deeper into benchmarking, performance analysis, and subtle consistency fixes - areas I wasn’t as familiar with at the start of GSoC.

From implementing new features like fwrite(select=) to tackling long-standing issues like melt()/dcast() inconsistencies, this work has been both challenging and rewarding.

As GSoC wraps up, I feel more confident navigating large OSS codebases, contributing to design discussions, and learning from maintainer feedback. I hope to continue contributing to data.table beyond GSoC. Thanks for reading and following along!

Mukul

Share: LinkedIn