bw_processing.datapackage ========================= .. py:module:: bw_processing.datapackage Classes ------- .. autoapisummary:: bw_processing.datapackage.Datapackage bw_processing.datapackage.DatapackageBase bw_processing.datapackage.FilteredDatapackage Functions --------- .. autoapisummary:: bw_processing.datapackage.create_datapackage bw_processing.datapackage.load_datapackage bw_processing.datapackage.simple_graph Module Contents --------------- .. py:class:: Datapackage Bases: :py:obj:`DatapackageBase` Interface for creating, loading, and using numerical datapackages for Brightway. Note that there are two entry points to using this class, both separate functions: ``create_datapackage()`` and ``load_datapackage()``. Do not create an instance of the class with ``Datapackage()``, unless you like playing with danger :) Data packages can be stored in memory, in a directory, or in a zip file. When creating data packages for use later, don't forget to call ``.finalize_serialization()``, or the metadata won't be written and the data package won't be usable. Potential gotchas: * There is currently no way to modify a zipped data package once it is finalized. * Resources that are interfaces to external data sources (either in Python or other) can't be saved, but must be recreated each time a data package is used. .. py:method:: _add_numpy_array_resource(*, array: numpy.ndarray, name: str, matrix: str, kind: str, keep_proxy: bool = False, matrix_serialize_format_type: Optional[bw_processing.constants.MatrixSerializeFormat] = None, meta_object: Optional[str] = None, meta_type: Optional[str] = None, **kwargs) -> None .. py:method:: _check_length_consistency() -> None .. py:method:: _create(fs: Optional[fsspec.AbstractFileSystem], name: Optional[str], id_: Optional[str], metadata: Optional[dict], combinatorial: bool = False, sequential: bool = False, seed: Optional[int] = None, sum_intra_duplicates: bool = True, sum_inter_duplicates: bool = False, matrix_serialize_format_type: bw_processing.constants.MatrixSerializeFormat = MatrixSerializeFormat.NUMPY) -> None Start a new data package. All metadata elements should follow the `datapackage specification `__. Licenses are specified as a list in ``metadata``. The default license is the `Open Data Commons Public Domain Dedication and License v1.0 `__. .. py:method:: _load(fs: fsspec.AbstractFileSystem, mmap_mode: Optional[str] = None, proxy: bool = False) -> None .. py:method:: _load_all(mmap_mode: Optional[str] = None, proxy: bool = False) -> None .. py:method:: _prepare_modifications() -> None .. py:method:: _prepare_name(name: str) -> str .. py:method:: add_csv_metadata(*, dataframe: pandas.DataFrame, valid_for: list, name: str = None, **kwargs) -> None Add an iterable metadata object to be stored as a CSV file. The purpose of storing metadata is to enable data exchange; therefore, this method assumes that data is written to disk. The normal use case of this method is to link integer indices from either structured or presample arrays to a set of fields that uniquely identifies each object. This allows for matching based on object attributes from computer to computer, where database ids or other computer-generated codes might not be consistent. Uses pandas to store and load data; therefore, metadata must already be a pandas dataframe. In contrast with presamples arrays, ``iterable_data_source`` cannot be an infinite generator. We need a finite set of data to build a matrix. In contrast to ``self.create_structured_array``, this always stores the dataframe in ``self.data``; no proxies are used. :param \* dataframe: Dataframe to be persisted to disk. :param \* valid_for: List of resource names that this metadata is valid for; must be either structured or presample indices arrays. Each item in ``valid_for`` has the form ``("resource_name", "rows" or "cols")``. ``resource_name`` should be either a structured or a presamples indices array. :param \* name: The name of this resource. Names must be unique in a given data package :type \* name: optional :param \* extra: Dict of extra metadata :type \* extra: optional :returns: Nothing, but appends objects to ``self.metadata['resources']`` and ``self.data``. :raises \* AssertionError: If inputs are not in correct form :raises \* AssertionError: If ``valid_for`` refers to unavailable resources .. py:method:: add_dynamic_array(*, matrix: str, interface: Any, indices_array: numpy.ndarray, name: Optional[str] = None, flip_array: Optional[numpy.ndarray] = None, keep_proxy: bool = False, matrix_serialize_format_type: Optional[bw_processing.constants.MatrixSerializeFormat] = None, **kwargs) -> None `interface` must support the presamples API. .. py:method:: add_dynamic_vector(*, matrix: str, interface: Any, indices_array: numpy.ndarray, name: Optional[str] = None, flip_array: Optional[numpy.ndarray] = None, keep_proxy: bool = False, matrix_serialize_format_type: Optional[bw_processing.constants.MatrixSerializeFormat] = None, **kwargs) -> None .. py:method:: add_entries(*, matrix: str, entries: list[bw_processing.matrix_entry.MatrixEntry], name: Optional[str] = None) -> None Add matrix data from a list of :class:`MatrixEntry` objects. High-level convenience method that does not require working directly with NumPy arrays. :param matrix: Name of the target matrix (e.g. ``"technosphere"``). :param entries: List of :class:`.MatrixEntry` instances. :param name: Optional resource group name; auto-generated if omitted. .. py:method:: add_json_metadata(*, data: Any, valid_for: str, name: str = None, **kwargs) -> None Add an iterable metadata object to be stored as a JSON file. The purpose of storing metadata is to enable data exchange; therefore, this method assumes that data is written to disk. The normal use case of this method is to provide names and other metadata for parameters whose values are stored as presamples arrays. The length of ``data`` should match the number of rows in the corresponding presamples array, and ``data`` is just a list of string labels for the parameters. However, this method can also be used to store other metadata, e.g. for external data resources. In contrast to ``self.create_structured_array``, this always stores the dataframe in ``self.data``; no proxies are used. :param \* data: Data to be persisted to disk. :param \* valid_for: Name of structured data or presample array that this metadata is valid for. :param \* name: The name of this resource. Names must be unique in a given data package :type \* name: optional :param \* extra: Dict of extra metadata :type \* extra: optional :returns: Nothing, but appends objects to ``self.metadata['resources']`` and ``self.data``. :raises \* AssertionError: If inputs are not in correct form :raises \* AssertionError: If ``valid_for`` refers to unavailable resources .. py:method:: add_persistent_array(*, matrix: str, data_array: numpy.ndarray, indices_array: numpy.ndarray, name: Optional[str] = None, flip_array: Optional[numpy.ndarray] = None, keep_proxy: bool = False, matrix_serialize_format_type: Optional[bw_processing.constants.MatrixSerializeFormat] = None, **kwargs) -> None .. py:method:: add_persistent_vector(*, matrix: str, indices_array: numpy.ndarray, name: Optional[str] = None, data_array: Optional[numpy.ndarray] = None, flip_array: Optional[numpy.ndarray] = None, distributions_array: Optional[numpy.ndarray] = None, keep_proxy: bool = False, matrix_serialize_format_type: Optional[bw_processing.constants.MatrixSerializeFormat] = None, **kwargs) -> None .. py:method:: add_persistent_vector_from_iterator(*, matrix: str = None, name: Optional[str] = None, dict_iterator: Any = None, nrows: Optional[int] = None, matrix_serialize_format_type: Optional[bw_processing.constants.MatrixSerializeFormat] = None, **kwargs) -> None Create a persistant vector from an iterator. Uses the utility function ``resolve_dict_iterator``. This is the **only array creation method which produces sorted arrays**. .. py:method:: finalize_serialization() -> None .. py:method:: write_modified() Write the data in modified files to the filesystem (if allowed). .. py:class:: DatapackageBase Bases: :py:obj:`abc.ABC` Base class for datapackages. Not for normal use - you should use either `Datapackage` or `FilteredDatapackage`. .. py:method:: __get_resources() -> list .. py:method:: __set_resources(dct: dict) -> None .. py:method:: _dehydrate_interfaces() -> None Substitute an interface resource with ``UndefinedInterface``, in preparation for finalizing data on disk. .. py:method:: _get_index(name_or_index: Union[str, int]) -> int Get index of a resource by name or index. Returning the same number is a bit silly, but makes the other code simpler :) :raises \* IndexError: ``name_or_index`` was too big :raises \* ValueError: Name ``name_or_index`` not found :raises \* NonUnique: Name ``name_or_index`` not unique in given resources .. py:method:: dehydrated_interfaces() -> List[str] Return a list of the resource groups which have dehydrated interfaces .. py:method:: del_resource(name_or_index: Union[str, int]) -> None Remove a resource, and delete its data file, if any. .. py:method:: del_resource_group(name: str) -> None Remove a resource group, and delete its data files, if any. Use ``exclude_resource_group`` if you want to keep the underlying resource in the filesystem. .. py:method:: exclude(filters: Dict[str, str]) -> FilteredDatapackage Filter a datapackage to exclude resources matching a filter. Usage cases: Filter out a given resource: exclude_generic({"matrix': "some_label"}) Filter out a resource group with a given kind: exclude_generic({"group': "some_group", "kind": "some_kind"}) .. py:method:: filter_by_attribute(key: str, value: Any) -> FilteredDatapackage Create a new ``FilteredDatapackage`` which satisfies the filter ``resource[key] == value``. All included objects are the same as in the original data package, i.e. no copies are made. No checks are made to ensure consistency with modifications to the original datapackage after the creation of this filtered datapackage. This method was introduced to allow for the efficient construction of matrices; each datapackage can have data for multiple matrices, and we can then create filtered datapackages which exclusively have data for the matrix of interest. As such, they should be considered read-only, though this is not enforced. .. py:method:: get_max_index_value() -> int Get maximum index value (max signed 32 or 64 bit integer) for this datapackage .. py:method:: get_resource(name_or_index: Union[str, int]) -> (Any, dict) Return data and metadata for ``name_or_index``. :param \* name_or_index: Name (str) or index (int) of a resource in the existing metadata. :raises \* IndexError: Integer index out of range of given metadata :raises \* ValueError: String name not present in metadata :raises \* NonUnique: String name present in two resource metadata sections :returns: (data object, metadata dict) .. py:method:: rehydrate_interface(name_or_index: Union[str, int], resource: Any, initialize_with_config: bool = False) -> None Substitute the undefined interface in this datapackage with the actual interface resource ``resource``. Loading a datapackage with an interface loads an instance of ``UndefinedInterface``, which should be substituted (rehydrated) with an actual interface instance. If ``initialize_with_config`` is true, the ``resource`` is initialized (i.e. ``resource(**config_data)``) with the resource data under the key ``config``. If ``config`` is missing, a ``KeyError`` is raised. ``name_or_index`` should be the data source name. If this value is a string and doesn't end with ``.data``, ``.data`` is automatically added. .. py:attribute:: _finalized :value: False .. py:attribute:: _matrix_serialize_format_type .. py:attribute:: _modified .. py:property:: groups :type: dict Return a dictionary of ``{group label: filtered datapackage}`` in the same order as the group labels are first encountered in the datapackage metadata. Ignores resources which don't have group labels. .. py:attribute:: resources .. py:class:: FilteredDatapackage Bases: :py:obj:`DatapackageBase` A subset of a datapackage. Used in matrix construction or other data manipulation operations. Should be treated as read-only. .. py:function:: create_datapackage(fs: Optional[fsspec.AbstractFileSystem] = None, name: Optional[str] = None, id_: Optional[str] = None, metadata: Optional[dict] = None, combinatorial: bool = False, sequential: bool = False, seed: Optional[int] = None, sum_intra_duplicates: bool = True, sum_inter_duplicates: bool = False, matrix_serialize_format_type: bw_processing.constants.MatrixSerializeFormat = MatrixSerializeFormat.NUMPY) -> Datapackage Create a new data package. All arguments are optional; if a `fsspec `__ filesystem is not provided, an in-memory `DictFS `__ will be used. All metadata elements should follow the `datapackage specification `__. Licenses are specified as a list in ``metadata``. The default license is the `Open Data Commons Public Domain Dedication and License v1.0 `__. :param \* fs: A ``Filesystem``, optional. A new ``DictFS`` is used if not provided. :param \* name: ``str``, optional. A new uuid is used if not provided. :param \* `id_`: ``str``, optional. A new uuid is used if not provided. :param \* metadata: ``dict``, optional. Metadata dictionary following datapackage specification; see above. :param \* combinatorial: ``bool``, default ``False`` .: Policy on how to sample columns across multiple data arrays; see readme. :param \* sequential: ``bool``, default ``False`` .: Policy on how to sample columns in data arrays; see readme. :param \* seed: ``int``, optional. Seed to use in random number generator. :param \* sum_intra_duplicates: ``bool``, default ``True``. Should duplicate elements in a single data resource be summed together, or should the last value replace previous values. :param \* sum_inter_duplicates: ``bool``, default ``False``. Should duplicate elements in across data resources be summed together, or should the last value replace previous values. Order of data resources is given by the order they are added to the data package. :param \* matrix_serialize_format_type: ``MatrixSerializeFormat``, default ``MatrixSerializeFormat.NUMPY``. Matrix serialization format type. :returns: A `Datapackage` instance. .. py:function:: load_datapackage(fs_or_obj: Union[DatapackageBase, fsspec.AbstractFileSystem], mmap_mode: Optional[str] = None, proxy: bool = False) -> Datapackage Load an existing datapackage. Can load proxies to data instead of the data itself, which can be useful when interacting with large arrays or large packages where only a subset of the data will be accessed. Proxies use something similar to `functools.partial` to create a callable class instead of returning the raw data (see https://github.com/brightway-lca/bw_processing/issues/9 for why we can't just use `partial`). datapackage access methods (i.e. `.get_resource`) will automatically resolve proxies when needed. :param \* fs_or_obj: A `Filesystem` or an instance of `DatapackageBase`. :param \* mmap_mode: `str`, optional. Define memory mapping mode to use when loading Numpy arrays. :param \* proxy: bool, default `False`. Load proxies instead of complete Numpy arrays; see above. :returns: A `Datapackage` instance. .. py:function:: simple_graph(data: dict, fs: Optional[fsspec.AbstractFileSystem] = None, **metadata) -> Datapackage Easy creation of simple datapackages with only persistent vectors. .. deprecated:: Use :func:`bw_processing.matrix_entry.create_datapackage_from_entries` with :class:`bw_processing.matrix_entry.MatrixEntry` objects instead. :param \* data: is a dictionary. The data dictionary has the form:: { matrix_name: [ (row_id, col_id, value, flip) ] } Where `row_id` and `col_id are an `int` s, value is a `float` and flip is a `bool` (False by default). :param \* fs: is a filesystem. :param \* metadata: are passed as kwargs to ``create_datapackage()``. :returns: the datapackage.