bw_processing.unique_fields

Functions

`as_unique_attributes`(data[, exclude, include, raise_error])	Format `data` as unique set of attributes and values for use in `create_processed_datapackage`.
`as_unique_attributes_dataframe`(df[, exclude, include, ...])
`greedy_set_cover`(data[, exclude, raise_error])	Find unique set of attributes that uniquely identifies each element in `data`.

Module Contents

bw_processing.unique_fields.as_unique_attributes(data, exclude=None, include=None, raise_error=False)[source]

Format data as unique set of attributes and values for use in create_processed_datapackage.

Each element in data must have the attribute id, and it must be unique. However, the field “id” is not used in selecting the unique set of attributes.

If no set of attributes is found that uniquely identifies all features is found, all fields are used. To have this case raise an error, pass raise_error=True.:

data = [
    {},
]

Parameters:

data (iterable) – List of dictionaries with the same fields.
exclude (iterable) – Fields to exclude during search for uniqueness. id is Always excluded.
include (iterable) – Fields to include when returning, even if not unique

Returns:

(list of field names as strings, dictionary of data ids to values for given field names)

Raises:

InconsistentFields – Not all features provides all fields.

bw_processing.unique_fields.as_unique_attributes_dataframe(df, exclude=None, include=None, raise_error=False)[source]

bw_processing.unique_fields.greedy_set_cover(data, exclude=None, raise_error=True)[source]

Find unique set of attributes that uniquely identifies each element in data.

Feature selection is a well known problem, and is analogous to the set cover problem, for which there is a well known heuristic.

Example:

data = [
    {'a': 1, 'b': 2, 'c': 3},
    {'a': 2, 'b': 2, 'c': 3},
    {'a': 1, 'b': 2, 'c': 4},
]
greedy_set_cover(data)
>>> {'a', 'c'}

Parameters:

data (iterable) – List of dictionaries with the same fields.
exclude (iterable) – Fields to exclude during search for uniqueness. id is Always excluded.

Returns:

Set of attributes (strings)

Raises:

NonUnique – The given fields are not enough to ensure uniqueness.

Note that NonUnique is not raised if raise_error is false.