Streaming CSV processor with type coercion and validation
gem install philiprehberger-csv_kitStreaming CSV processor with type coercion and validation
Add to your Gemfile:
gem "philiprehberger-csv_kit"
Or install directly:
gem install philiprehberger-csv_kit
require "philiprehberger/csv_kit"
rows = Philiprehberger::CsvKit.to_hashes("data.csv")
# => [{name: "Alice", age: "30"}, ...]
names = Philiprehberger::CsvKit.pluck("data.csv", :name, :city)
# => [{name: "Alice", city: "Berlin"}, ...]
Philiprehberger::CsvKit.headers("data.csv")
# => [:name, :age, :city]
Philiprehberger::CsvKit.count("data.csv")
# => 1000
Iterate rows with constant memory. Returns an Enumerator if no block is given:
Philiprehberger::CsvKit.each_hash("large.csv") do |row|
puts row[:name]
end
# Or compose with Enumerator methods:
adults = Philiprehberger::CsvKit.each_hash("data.csv")
.select { |r| r[:age].to_i >= 18 }
.first(10)
Return n randomly sampled rows with O(n) memory using Knuth's Algorithm R. If the file has fewer than n rows, all rows are returned:
rows = Philiprehberger::CsvKit.sample("large.csv", 100)
# => [{name: "Alice", age: "30"}, ...]
Return the first row that matches a predicate, streaming and stopping on the first hit:
user = Philiprehberger::CsvKit.find("users.csv") { |row| row[:email] == "a@b.com" }
# => {email: "a@b.com", name: "Alice"} or nil
csv_string = Philiprehberger::CsvKit.filter("data.csv") do |row|
row[:age].to_i >= 30
end
rows = Philiprehberger::CsvKit.process("data.csv") do |p|
p.transform(:age) { |v| v.to_i }
p.validate(:age) { |v| v.to_i.positive? }
p.reject { |row| row[:city] == "Unknown" }
p.each { |row| puts row[:name] }
end
Fill nil or empty-string cells with a default value before any type coercion runs:
Philiprehberger::CsvKit.process("users.csv") do |p|
p.default(:country, "US")
p.type(:age, :integer)
end
rows = Philiprehberger::CsvKit.process("data.csv") do |p|
p.type(:birthday, :date)
p.type(:created_at, :datetime, format: "%Y-%m-%dT%H:%M:%S")
end
rows = Philiprehberger::CsvKit.to_hashes("data.csv", dialect: :excel)
rows = Philiprehberger::CsvKit.process("data.csv", dialect: { delimiter: ";", quote: "'" }) do |p|
p.transform(:age, &:to_i)
end
Inverse of to_hashes. Serialize an array of hashes to a CSV string. Headers default to the keys of the first row:
csv = Philiprehberger::CsvKit.to_csv([
{ name: "Alice", age: 30 },
{ name: "Bob", age: 25 }
])
# => "name,age\nAlice,30\nBob,25\n"
# Control column order / subset with explicit headers
Philiprehberger::CsvKit.to_csv(rows, headers: [:name])
writer = Philiprehberger::CsvKit::Writer.new(headers: [:name, :age])
csv_string = writer.write([{ name: "Alice", age: 30 }, { name: "Bob", age: 25 }])
File.open("output.csv", "w") do |f|
writer.write_to([{ name: "Alice", age: 30 }], f)
end
File.open("output.csv", "w") do |f|
Philiprehberger::CsvKit::Writer.stream(f, headers: [:name, :age]) do |w|
w << { name: "Alice", age: 30 }
w << { name: "Bob", age: 25 }
end
end
rows = Philiprehberger::CsvKit.process("data.csv") do |p|
p.on_error { |row, err| :skip }
p.transform(:age) { |v| Integer(v) }
end
rows = Philiprehberger::CsvKit.process("data.csv") do |p|
p.skip(10) # skip first 10 rows
p.limit(50) # stop after 50 rows
end
rows = Philiprehberger::CsvKit.process("data.csv") do |p|
p.rename(:raw_col, :clean_col)
end
delimiter = Philiprehberger::CsvKit::Detector.detect("data.tsv")
# => "\t"
require 'philiprehberger/csv_kit'
# users.csv:
# name,age
# Alice,30
# Bob,25
Philiprehberger::CsvKit.transpose('users.csv')
# => { name: ['Alice', 'Bob'], age: ['30', '25'] }
| Method / Class | Description |
|---|---|
CsvKit.to_hashes(path_or_io, dialect:) | Load CSV into array of symbolized hashes |
CsvKit.transpose(path_or_io, dialect:) | Returns a column-oriented hash mapping each header to its column of values |
CsvKit.to_csv(rows, headers:, dialect:) | Serialize an array of hashes to a CSV string |
CsvKit.sample(path_or_io, n, dialect:) | Return n randomly sampled rows using reservoir sampling (Algorithm R) |
CsvKit.pluck(path_or_io, *keys, dialect:) | Extract specific columns |
CsvKit.filter(path_or_io, dialect:, &block) | Filter rows, return CSV string |
CsvKit.find(path_or_io, dialect:, &block) | Return the first row matching the predicate, or nil |
CsvKit.headers(path_or_io, dialect:) | Return header row as array of symbols |
CsvKit.count(path_or_io, dialect:) | Count data rows without loading into memory |
CsvKit.each_hash(path_or_io, dialect:, &block) | Stream rows as symbolized hashes; returns Enumerator if no block |
CsvKit.process(path_or_io, dialect:, &block) | Streaming DSL with transforms and validations |
Processor#headers(*names) | Override header names |
Processor#transform(key, &block) | Register column transform |
Processor#type(key, type, **opts) | Register built-in type coercion (:integer, :float, :string, :date, :datetime) |
Processor#default(key, value) | Fill nil or empty cells at key with value (runs before type coercion) |
Processor#validate(key, &block) | Register column validation (skip invalid) |
Processor#skip(n) | Skip the first N data rows |
Processor#limit(n) | Stop after processing N rows |
Processor#reject(&block) | Reject rows matching predicate |
Processor#each(&block) | Callback for each processed row |
Processor#on_error(&block) | Per-row error handler (return :skip or :abort) |
Processor#max_errors(n) | Stop after N errors |
Processor#errors | Collected errors from last run |
Processor#rename(from, to) | Rename column during processing |
Processor#after_each(&block) | Callback after each row is fully processed |
Writer.new(headers:) | Create a CSV writer with given headers |
Writer#write(rows) | Generate CSV string from rows |
Writer#write_to(rows, io) | Write CSV to an IO object |
Writer.stream(io, headers:, dialect:) | Stream CSV rows incrementally to an IO |
Dialect.new(name_or_hash) | Create a dialect from preset or custom hash |
Detector.detect(path_or_io) | Auto-detect CSV delimiter |
Row#[](key) | Access value by symbol key |
Row#keys | Column names as array of symbols |
Row#values | Column values as array |
Row#size | Number of columns |
Row#each { |k, v| } | Iterate key-value pairs (Enumerable) |
Row#merge(other) | Return new Row with merged data |
Row#to_h | Convert row to plain hash |
bundle install
bundle exec rspec
bundle exec rubocop
If you find this project useful: