Module nimdata

NimData's core data type is a generic DataFrame[T]. The methods of a data frame can be categorized into generalizations of the Map/Reduce concept:

  • Transformations: Operations like map or filter transform one data frame into another. Transformations are lazy and can be chained. They will only be executed once an action is called.
  • Actions: Operations like count, min, max, sum, reduce, fold, collect, or show perform an aggregation of a data frame, and trigger the processing pipeline.

NimData is structured into the following submodules:

This main module re-exports some symbols of these modules for convenience, so that import nimdata is sufficient in most cases.

Types

DataFrame[T] = ref object of RootObj
  Source Edit
DataFrameContext = object
  Source Edit
FileType = enum
  Auto, RawText, GZip
  Source Edit

Lets

DF = DataFrameContext()
Currently this constant is purely used for scoping, allowing to write expressions like DF.fromFile(...) or DF.fromSeq(...). Eventually this might be used to store general context configuration.   Source Edit

Procs

proc map[U, T](df: DataFrame[U]; f: proc (x: U): T): DataFrame[T]
Transforms a DataFrame[U] into a DataFrame[T] by applying a mapping function f.   Source Edit
proc mapWithIndex[U, T](df: DataFrame[U]; f: proc (i: int; x: U): T): DataFrame[T]
Transforms a DataFrame[U] into a DataFrame[T] by applying a mapping function f.   Source Edit
proc filter[T](df: DataFrame[T]; f: proc (x: T): bool): DataFrame[T]
Filters a data frame by applying a filter function f.   Source Edit
proc filterWithIndex[T](df: DataFrame[T]; f: proc (i: int; x: T): bool): DataFrame[T]
Filters a data frame by applying a filter function f.   Source Edit
proc take[T](df: DataFrame[T]; n: int): DataFrame[T]
Selects the first n rows of a data frame.   Source Edit
proc drop[T](df: DataFrame[T]; n: int): DataFrame[T]
Discards the first n rows of a data frame.   Source Edit
proc sample[T](df: DataFrame[T]; probability: float): DataFrame[T]
Filters a data frame by applying Bernoulli sampling with the specified sampling probability.   Source Edit
proc flatMap[U, T](df: DataFrame[U]; f: proc (x: U): seq[T]): DataFrame[T]
Transforms a DataFrame[U] into a DataFrame[T] by applying f to each element of the input data frame, and inserting the elements of the output seq[T] into the result data frame.   Source Edit
proc flatMap[U, T](df: DataFrame[U]; fIter: proc (x: U): (iterator (): T)): DataFrame[T]
Transforms a DataFrame[U] into a DataFrame[T] by applying an iterator fIter to each element of the input data frame.   Source Edit
proc unique[T](df: DataFrame[T]): DataFrame[T]
Returns a data frame, which consists of the unique values of the input data frame. Note that the memory requirement is linear in the number of unique values, so use with care. Type T must provide a hash function with signature hash(x: T): Hash (see hashes documentation).   Source Edit
proc valueCounts[T](df: DataFrame[T]): DataFrame[tuple[key: T, count: int]]
Returns a data frame, which consists of the unique values and theirs respective counts. Thus, the type of the resulting data frame is a tuple of (key: T, count: int). Note that the memory requirement is linear in the number of unique values, so use with care. Type T must provide a hash function with signature hash(x: T): Hash (see hashes documentation).   Source Edit
proc sort[T, U](df: DataFrame[T]; f: proc (x: T): U;
              order: SortOrder = SortOrder.Ascending): DataFrame[T]
Returns a sorted data frame, where f defines the sort key. Note: The current implementation does not yet use a spill-to-disk, so the data frame must fit into memory.   Source Edit
proc sort[T](df: DataFrame[T]; order: SortOrder = SortOrder.Ascending): DataFrame[T]
Returns a sorted data frame. The current implementation does not yet use a spill-to-disk, so the data frame must fit into memory.   Source Edit
proc groupBy[T, K, U](df: DataFrame[T]; keyFunc: proc (x: T): K;
                   reduceFunc: proc (key: K; df: DataFrame[T]): U): DataFrame[U]
Groups a data frame according to keyFunc and applies reduceFunc to each group.   Source Edit
proc join[A, B, C](dfA: DataFrame[A]; dfB: DataFrame[B]; cmpFunc: (a: A, b: B) -> bool;
                projectFunc: (a: A, b: B) -> C): DataFrame[C]
Performs on inner join of two data frames based on the given cmpFunc. The result can be arbitrarily merged using the projectFunc. When working with named tuples, the macro mergeTuple can be used as a convenient way to merge the fields of tuple A and B. The current implementation caches dfB internally. Thus, when joining a large and a small data frame, make sure that the left (dfA) is the large one and the right (dfB) is the smaller one.   Source Edit
proc count[T](df: DataFrame[T]): int
Iterates over a data frame, and returns its length   Source Edit
proc reduce[T](df: DataFrame[T]; f: proc (a, b: T): T): T
Applies a reduce function f to the data frame following the pattern f( ... f(f(f(x[0], x[1]), x[2]), x[3]) ...).   Source Edit
proc fold[U, T](df: DataFrame[U]; init: T; f: proc (a: T; b: U): T): T
Applies a fold/aggregation function f to the data frame following the pattern f( ... f(f(f(init, x[0]), x[1]), x[2]) ...).   Source Edit
proc cache[T](df: DataFrame[T]): DataFrame[T]
Executes all chained operations on a data frame and returns a new data frame which is cached in memory. This will speed up subsequent operations on the data frame, and is useful when you have to perform multiple operation on the same data. However, make sure that you have enough memory to cache the input data.   Source Edit
proc forEach[T](df: DataFrame[T]; f: proc (x: T): void)
Applies a function f to all elements of a data frame.   Source Edit
proc echoGeneric[T](x: T) {.
procvar
.}
Convenience to allow df.forEach(echoGeneric)   Source Edit
proc show[T: not tuple](df: DataFrame[T]; s: Stream = newFileStream(stdout))
Prints the content of the data frame using generic to string conversion. If no stream is specified, the output is written to stdout.   Source Edit
proc show[T: tuple](df: DataFrame[T]; s: Stream = newFileStream(stdout))
Prints the content of the data frame in the form of an ASCII table. If no stream is specified, the output is written to stdout.   Source Edit
proc sum[T](df: DataFrame[T]): T
Computes the sum of a data frame of numerical type T.   Source Edit
proc mean[T](df: DataFrame[T]): float
Computes the mean of a data frame of numerical type T.   Source Edit
proc min[T](df: DataFrame[T]): T
Computes the minimum of a data frame of numerical type T.   Source Edit
proc max[T](df: DataFrame[T]): T
Computes the maximum of a data frame of numerical type T.   Source Edit
proc toCsv[T: tuple |
    object](df: DataFrame[T]; filename: string; sep: char = ';')
Store the data frame in a CSV   Source Edit
proc toHtml[T: tuple |
    object](df: DataFrame[T]; filename: string)
Store the data frame in an HTML providing a simple table view. The current implementation uses simple static HTML, so make sure that your data frame is filtered down to a reasonable size.   Source Edit
proc openInBrowser[T: tuple |
    object](df: DataFrame[T])
Opens a table view of the data frame in the default browser.   Source Edit
proc fromSeq[T](dfc: DataFrameContext; data: seq[T]): DataFrame[T]
Constructs a data frame from a sequence.   Source Edit
proc fromRange(dfc: DataFrameContext; indexFrom: int; indexUpto: int): DataFrame[int] {.
raises: [], tags: []
.}
Constructs a DataFrame[int] which iterates over the interval [indexFrom, indexUpto), i.e., from indexFrom (inclusive) up to indexUpto (exclusive).   Source Edit
proc fromRange(dfc: DataFrameContext; indexUpto: int): DataFrame[int] {.
raises: [], tags: []
.}
Constructs a DataFrame[int] which iterates over the interval [0, indexUpto), i.e., from 0 (inclusive) up to indexUpto (exclusive).   Source Edit
proc fromFile(dfc: DataFrameContext; filename: string;
             fileType: FileType = FileType.Auto; hasHeader: bool = true): DataFrame[
    string] {.
raises: [], tags: []
.}
Constructs a data frame from a file, iterating the file line by line. By default the file type is inferred from the file name, but it can also be specified explicitly.   Source Edit

Methods

method iter[T](df: DataFrame[T]): (iterator (): T) {.
base
.}
  Source Edit
method iter[T](df: CachedDataFrame[T]): (iterator (): T)
  Source Edit
method iter[T, U](df: MappedDataFrame[T, U]): (iterator (): U)
  Source Edit
method iter[T, U](df: MappedIndexDataFrame[T, U]): (iterator (): U)
  Source Edit
method iter[T](df: FilteredDataFrame[T]): (iterator (): T)
  Source Edit
method iter[T](df: FilteredIndexDataFrame[T]): (iterator (): T)
  Source Edit
method iter[T, U](df: FlatMappedSeqDataFrame[T, U]): (iterator (): U)
  Source Edit
method iter[T, U](df: FlatMappedDataFrame[T, U]): (iterator (): U)
  Source Edit
method iter[T](df: UniqueDataFrame[T]): (iterator (): T)
  Source Edit
method iter[T](df: ValueCountsDataFrame[T]): (
    iterator (): tuple[key: T, count: int])
  Source Edit
method iter[T, U](df: SortDataFrame[T, U]): (iterator (): T)
  Source Edit
method iter[T, K, U](df: GroupByReduceDataFrame[T, K, U]): (iterator (): U)
  Source Edit
method iter[A, B, C](df: JoinThetaDataFrame[A, B, C]): (iterator (): C)
  Source Edit
method collect[T](df: DataFrame[T]): seq[T] {.
base
.}
Collects the content of a DataFrame[T] and returns it as seq[T].   Source Edit
method collect[T](df: CachedDataFrame[T]): seq[T]
Specialized implementation   Source Edit
method iter(df: RangeDataFrame): (iterator (): int) {.
raises: [], tags: []
.}
  Source Edit
method iter(df: FileRowsDataFrame): (iterator (): string) {.
raises: [], tags: []
.}
  Source Edit
method iter(df: FileRowsGZipDataFrame): (iterator (): string) {.
raises: [], tags: []
.}
  Source Edit