sparsity package

Submodules

sparsity.indexing.get_indexers_list()
class sparsity.io.LocalFileSystem

Bases: object

open()

Open file and return a stream. Raise IOError upon failure.

file is either a text or byte string giving the name (and the path if the file isn’t in the current working directory) of the file to be opened or an integer file descriptor of the file to be wrapped. (If a file descriptor is given, it is closed when the returned I/O object is closed, unless closefd is set to False.)

mode is an optional string that specifies the mode in which the file is opened. It defaults to ‘r’ which means open for reading in text mode. Other common values are ‘w’ for writing (truncating the file if it already exists), ‘x’ for creating and writing to a new file, and ‘a’ for appending (which on some Unix systems, means that all writes append to the end of the file regardless of the current seek position). In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.) The available modes are:

Character Meaning
‘r’ open for reading (default)
‘w’ open for writing, truncating the file first
‘x’ create a new file and open it for writing
‘a’ open for writing, appending to the end of the file if it exists
‘b’ binary mode
‘t’ text mode (default)
‘+’ open a disk file for updating (reading and writing)
‘U’ universal newline mode (deprecated)

The default mode is ‘rt’ (open for reading text). For binary random access, the mode ‘w+b’ opens and truncates the file to 0 bytes, while ‘r+b’ opens the file without truncation. The ‘x’ mode implies ‘w’ and raises an FileExistsError if the file already exists.

Python distinguishes between files opened in binary and text modes, even when the underlying operating system doesn’t. Files opened in binary mode (appending ‘b’ to the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when ‘t’ is appended to the mode argument), the contents of the file are returned as strings, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

‘U’ mode is deprecated and will raise an exception in future versions of Python. It has no effect in Python 3. Use newline to control universal newlines mode.

buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:

  • Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.
  • “Interactive” text files (files for which isatty() returns True) use line buffering. Other text files use the policy described above for binary files.

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings.

errors is an optional string that specifies how encoding errors are to be handled—this argument should not be used in binary mode. Pass ‘strict’ to raise a ValueError exception if there is an encoding error (the default of None has the same effect), or pass ‘ignore’ to ignore errors. (Note that ignoring encoding errors can lead to data loss.) See the documentation for codecs.register or run ‘help(codecs.Codec)’ for a list of the permitted encoding error strings.

newline controls how universal newlines works (it only applies to text mode). It can be None, ‘’, ‘n’, ‘r’, and ‘rn’. It works as follows:

  • On input, if newline is None, universal newlines mode is enabled. Lines in the input can end in ‘n’, ‘r’, or ‘rn’, and these are translated into ‘n’ before being returned to the caller. If it is ‘’, universal newline mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
  • On output, if newline is None, any ‘n’ characters written are translated to the system default line separator, os.linesep. If newline is ‘’ or ‘n’, no translation takes place. If newline is any of the other legal values, any ‘n’ characters written are translated to the given string.

If closefd is False, the underlying file descriptor will be kept open when the file is closed. This does not work when a file name is given and must be True in that case.

A custom opener can be used by passing a callable as opener. The underlying file descriptor for the file object is then obtained by calling opener with (file, flags). opener must return an open file descriptor (passing os.open as opener results in functionality similar to passing None).

open() returns a file object whose type depends on the mode, and through which the standard file operations such as reading and writing are performed. When open() is used to open a file in a text mode (‘w’, ‘r’, ‘wt’, ‘rt’, etc.), it returns a TextIOWrapper. When used to open a file in a binary mode, the returned class varies: in read binary mode, it returns a BufferedReader; in write binary and append binary modes, it returns a BufferedWriter, and in read/write mode, it returns a BufferedRandom.

It is also possible to use a string or bytearray as a file for both reading and writing. For strings StringIO can be used like a file opened in a text mode, and for bytes a BytesIO can be used like a file opened in a binary mode.

sparsity.io.path2str(arg)

Convert arg into its string representation.

This is only done if arg is subclass of PurePath

sparsity.io.read_npz(filename, storage_options=None)

Read from a npz file.

Parameters:
  • filename (str) – path to file.
  • storage_options (dict) – (optional) storage options for external filesystems.
Returns:

sf

Return type:

sp.SparseFrame

sparsity.io.to_npz(sf, filename, block_size=None, storage_options=None)

Write to npz file format.

Parameters:
  • sf (sp.SparseFrame) – sparse frame to store.
  • filename (str) – path to write to.
  • block_size (int) – block size in bytes when sending data to external filesystem. Default is 100MB.
  • storage_options (dict) – (optional) storage options for external filesystems.
Returns:

sf

Return type:

SparseFrame

class sparsity.sparse_frame.SparseFrame(data, index=None, columns=None, **kwargs)

Bases: object

Two dimensional, size-mutable, homogenous tabular data structure with labeled axes (rows and columns). It adds pandas indexing abilities to a compressed row sparse frame based on scipy.sparse.csr_matrix. This makes indexing along the first axis extremely efficient and cheap. Indexing along the second axis should be avoided if possible though.

For a distributed implementation see sparsity.dask.SparseFrame.

add(other, how='outer', fill_value=0, **kwargs)

Aligned addition. Adds two tables by aligning them first.

Parameters:
  • other (sparsity.SparseFrame) – Another SparseFrame.
  • how (str) – How to join frames along their indexes. Default is ‘outer’ which makes the result contain labels from both frames.
  • fill_value (float) – Fill value if other frame is not exactly the same shape. For sparse data the only sensible fill value is 0. Passing any other value will result in a ValueError.
Returns:

added

Return type:

sparsity.SparseFrame

assign(**kwargs)

Assign new columns.

Parameters:kwargs (dict) – Mapping from column name to values. Values must be of correct shape to be inserted successfully.
Returns:assigned
Return type:SparseFrame
axes
columns

Return column labels

Returns:index
Return type:pd.Index
classmethod concat(tables, axis=0)

Concat a collection of SparseFrames along given axis.

Uses join internally so it might not be very efficient.

Parameters:
  • tables (list) – a list of SparseFrames.
  • axis – which axis to concatenate along.
copy(*args, deep=True, **kwargs)

Copy frame

Parameters:
  • args – are passed to indizes and values copy methods
  • deep (bool) – if true (default) data will be copied as well.
  • kwargs – are passed to indizes and values copy methods
Returns:

copy

Return type:

SparseFrame

data

Return data matrix

Returns:data
Return type:scipy.spar.csr_matrix
drop(labels, axis=1)

Drop label(s) from given axis.

Currently works only for columns.

Parameters:
  • labels (array-like) – labels to drop from the columns
  • axis (int) – only columns are supported atm.
Returns:

df

Return type:

SparseFrame

drop_duplicate_idx(**kwargs)

Drop rows with duplicated index.

Parameters:kwargs – kwds are passed to pd.Index.duplicated
Returns:dropeed
Return type:SparseFrame
dropna()

Drop nans from index.

fillna(value)

Replace NaN values in explicitly stored data with value.

Parameters:value (scalar) – Value to use to fill holes. value must be of same dtype as the underlying SparseFrame’s data. If 0 is chosen new matrix will have these values eliminated.
Returns:filled
Return type:SparseFrame
groupby_agg(by=None, level=None, agg_func=None)

Aggregate data using callable.

The by and level arguments are mutually exclusive.

Parameters:
  • by (array-like, string) – grouping array or grouping column name
  • level (int) – which level from index to use if multiindex
  • agg_func (callable) – Function which will be applied to groups. Must accept a SparseFrame and needs to return a vector of shape (1, n_cols).
Returns:

sf – aggregated result

Return type:

SparseFrame

groupby_sum(by=None, level=0)

Optimized sparse groupby sum aggregation.

Simple operation using sparse matrix multiplication. Expects result to be sparse as well.

The by and level arguments are mutually exclusive.

Parameters:
  • by (np.ndarray (optional)) – Alternative index.
  • level (int) – Level of (multi-)index to group on.
Returns:

df – Grouped by and summed SparseFrame.

Return type:

sparsity.SparseFrame

head(n=1)

Return rows from the top of the table.

Parameters:n (int) – how many rows to return, default is 1
Returns:head
Return type:SparseFrame
iloc

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

index

Return index labels

Returns:index
Return type:pd.Index
join(other, axis=1, how='outer', level=None)

Join two tables along their indices.

Parameters:
  • other (sparsity.SparseTable) – another SparseFrame
  • axis (int) – along which axis to join
  • how (str) – one of ‘inner’, ‘outer’, ‘left’, ‘right’
  • level (int) – if axis is MultiIndex, join using this level
Returns:

joined

Return type:

sparsity.SparseFrame

loc

partial(func, *args, **keywords) - new function with partial application of the given arguments and keywords.

max(*args, **kwargs)

Find maximum element(s).

mean(*args, **kwargs)

Calculate mean(s).

min(*args, **kwargs)

Find minimum element(s)

multiply(other, axis='columns')

Multiply SparseFrame row-wise or column-wise.

Parameters:
  • other (array-like) – Vector of numbers to multiply columns/rows by.
  • axis (int | str) –
    • 1 or ‘columns’ to multiply column-wise (default)
    • 0 or ‘index’ to multiply row-wise
nnz()

Get the count of explicitly stored values (nonzeros).

classmethod read_npz(filename, storage_options=None)

Read from numpy npz format.

Reads the sparse frame from a npz archive. Supports reading npz archives from remote locations with GCSFS and S3FS.

Parameters:
  • filename (str) – path or uri to location
  • storage_options (dict) – further options for the underlying filesystem
Returns:

sf

Return type:

SparseFrame

reindex(labels=None, index=None, columns=None, axis=None, *args, **kwargs)

Conform SparseFrame to new index.

Missing values will be filled with zeroes.

Parameters:
  • labels (array-like) – New labels / index to conform the axis specified by ‘axis’ to.
  • columns (index,) – New labels / index to conform to. Preferably an Index object to avoid duplicating data
  • axis (int) – Axis to target. Can be either (0, 1).
  • kwargs (args,) – Will be passed to reindex_axis.
Returns:

reindexed

Return type:

SparseFrame

reindex_axis(labels, axis=0, method=None, level=None, copy=True, limit=None, fill_value=0)

Conform SparseFrame to new index.

Missing values will be filled with zeros.

Parameters:
  • labels (array-like) – New labels / index to conform the axis specified by ‘axis’ to.
  • axis (int) – Axis to target. Can be either (0, 1).
  • method (None) – unsupported
  • level (None) – unsupported
  • copy (None) – unsupported
  • limit (None) – unsupported
  • fill_value (None) – unsupported
Returns:

reindexed

Return type:

SparseFrame

rename(columns, inplace=False)

Rename columns by applying a callable to every column name.

Parameters:
  • columns (callable) – a callable that will accepts a column element and returns the new column label.
  • inplace (bool) – if true the operation will be executed inplace
Returns:

renamed

Return type:

SparseFrame | None

set_index(column=None, idx=None, level=None, inplace=False)

Set index from array, column or existing multi-index level.

Parameters:
  • column (str) – set index from existing column in data.
  • idx (pd.Index, np.array) – Set the index directly with a pandas index object or array
  • level (int) – set index from a multiindex level. useful for groupbys.
  • inplace (bool) – perform data transformation inplace
Returns:

sf – the transformed sparse frame or None if inplace was True

Return type:

sp.SparseFrame | None

sort_index()

Sort table along index.

Returns:sorted
Return type:sparsity.SparseFrame
sum(*args, **kwargs)

Sum elements.

take(idx, axis=0, **kwargs)

Return data at integer locations.

Parameters:
  • idx (array-like | int) – array of integer locations
  • axis – which axis to index
  • kwargs – not used
Returns:

indexed – reindexed sparse frame

Return type:

SparseFrame

to_npz(filename, block_size=None, storage_options=None)

Save to numpy npz format.

Parameters:
  • filename (str) – path to local file ot s3 path starting with s3://
  • block_size (int) – block size in bytes only has effect if writing to remote storage if set to None defaults to 100MB
  • storage_options (dict) – additional parameters to pass to FileSystem class; only useful when writing to remote storages
toarray()

Return dense np.array representation.

todense(pandas=True)

Return dense representation.

Parameters:pandas (bool) – If true returns a pandas DataFrame (default), else a numpy array is returned.
Returns:dense – dense representation
Return type:pd.DataFrame | np.ndarray
values

CSR Matrix represenation of frame

classmethod vstack(frames)

Vertical stacking given collection of SparseFrames.

sparsity.sparse_frame.sparse_one_hot(df, column=None, categories=None, dtype='f8', index_col=None, order=None, prefixes=False, ignore_cat_order_mismatch=False)

One-hot encode specified columns of a pandas.DataFrame. Returns a SparseFrame.

See the documentation of sparsity.dask.reshape.one_hot_encode().