Weeks 10–11 Progress
As the project nears its conclusion, I focused on two main tracks: refining my earlier fwrite(select=)
PR and contributing consistency fixes in melt()
and dcast()
as part of Issue #6629
.
1. Continued Work on fwrite(select=)
(PR #7236
— Open)
In previous weeks, I introduced a new select
argument to fwrite()
that allows writing only specific columns without creating temporary objects. This greatly improves memory efficiency for large datasets.
During Weeks 10–11, I refined this PR based on performance discussions with maintainers. Using atime::atime_versions
, we benchmarked fwrite()
across different scenarios (varying rows vs. varying columns) on both Windows and Ubuntu.
Key findings:
- When
select
is provided as character names, the cost of name lookup isO(cols)
, leading to linear time/memory when the number of columns is very large. - Passing numeric indices avoids this overhead, enabling near-constant time and memory in such cases.
- A fix was proposed to bypass unnecessary shallow views and name resolution for numeric selections, reducing overhead.
- Platform-specific differences (e.g., Windows allocator thresholds) were observed and explained as expected runtime behavior rather than correctness issues.
The PR remains open, with ongoing refinements around edge cases and the benchmarking harness.
2. Breaking Change in dcast()
— PR #7260
(Merged)
Issue #6629
identified two inconsistencies in melt()
and dcast()
. The first concerned dcast()
, where fun.aggregate
must return a scalar. Previously:
- If
fill = NULL
, non-scalar returns correctly errored. - If
fill
was non-NULL
, it only warned but still produced undefined results.
PR #7260
makes behavior consistent:
fun.aggregate
must always return length 1, regardless offill
.- The transitional warning was removed and undefined behavior eliminated.
This was merged, completing one half of Issue #6629
.
3. Breaking Change in melt()
— PR #7257
(In Progress)
The second inconsistency was with melt()
. Docs stated that when measure.vars
is provided as a list, the variable
column should contain integer indices. However, when the list length was 1, it returned a character name instead.
PR #7257
makes this consistent:
- For all list-based
measure.vars
(including length 1),variable
now contains integer indices. - For character or integer vector
measure.vars
, behavior is unchanged (names are returned). - With
variable.factor = TRUE
, levels are'1'
,'2'
, … for list-based cases.
I initially changed code outside the list path; after mentor feedback, I reverted non-list edits and followed the approach of PR #5247
, ensuring only the intended branch is modified.
This PR is under review and being refined with further feedback.
closing thoughts
The last few weeks were less about working on new features and more about refining, debugging, and aligning behavior with documentation. This phase pushed me deeper into benchmarking, performance analysis, and subtle consistency fixes - areas I wasn’t as familiar with at the start of GSoC.
From implementing new features like fwrite(select=)
to tackling long-standing issues like melt()
/dcast()
inconsistencies, this work has been both challenging and rewarding.
As GSoC wraps up, I feel more confident navigating large OSS codebases, contributing to design discussions, and learning from maintainer feedback. I hope to continue contributing to data.table
beyond GSoC.
Thanks for reading and following along!
Mukul