Importing#
Leaning from examples
Importing data is the worst part of the LCA world, and it is easy to get overwhelmed. This document discusses basic concepts, but also includes a set of examples that show the concepts in practice.
Prepared projects
Some data has already been prepared for Brightway, so you can skip doing the import yourself. Look at the initial project data and at bw2io.ecoinvent.import_ecoinvent_release
.
Introduction#
There are some standards for life cycle inventory data, but the sad truth is that there are no really good standards, and each implementation of the standards by different software providers has its own quirks.
The biggest difficulty we face is in constructing the supply chain graph from its serialization in the given data format. In the most basic case, a graph can be written:
nodes = ['A', 'B']
edges = [{'from': 'A', 'to': 'B', 'amount': 42}]
In this case, there is a a single unambiguous identifier for each node, and everything is easy. In the real world, edges are normally defined by the attributes of the referenced nodes, like this:
nodes = [{
'id': 1,
'name': 'foo'
}, {
'id': 2,
'name': 'bar'
}]
edges = [{'from': 'foo', 'to': 'bar', 'amount': 42}]
This type of data exchange format is convenient for data creators, as some formats simply lack a single unambiguous identifier, and in others this identifier is generated at run-time. However, to use this data we now need to figure out that the node with name “foo” is the one with “id” 1
- and real world matching is more complex than this simple example, and needs to be done hundreds or thousands of times. In general, we need to write our import logic assuming that we will have to do attribute matching.
Each main LCA data format comes with its own set of challenges:
Format |
Gotchas |
---|---|
ecospold 1 |
Unique identifier is |
ecospold 2 |
Separate UUIDs for activity and product, but they aren’t easy to find, and you need to combine them to get a single identifier for a process with a reference process located in time and space |
SimaPro CSV |
No guarantee of any attribute uniqueness, exports don’t always include internal ids, attributes changed from original data for common background databases, exported data can include errors[1] |
Excel |
Potential whitespace errors, no support for anything but primitive data types, custom templates require custom code |
OLCA JSON-LD |
Format can change in incompatible ways, lots of NIH |
ILCD |
Everything we love about XML, better for individual datasets than complete databases, in practice requires using EU elementary flow list and UUID mapping |
Exiobase |
Not a single format, elementary flow list changes over time |
The basic strategy for importing data is the following:
First, data is extracted from the given format. Extraction is done using a format-specific extractor. Currently, there are extractors for ecospold1
, ecospold2
, simapro CSV
, exiobase
, and the standard Brightway excel
template.
Next, we have to choose where and how to transform the data, either for direct import or for linking to other databases already present. Harmonization could mean transforming the data to fit a certain taxonomy, to fit a certain data model, or to meet other data modelling assumptions. There is a lot of variety in LCA data formats, and the most popular software allows for a lot of flexibility, so there are no general rules.
We then need to create concrete links in each edge - i.e. to turn 'from': 'foo'
to 'from': 1
. To do this, we define the fields we want to link with, and use the built-in Brightway
linking functionality. This is normally an iterative process, trying different linking approaches combined with harmonization and examining the linking status until the imported data is completely linked.
Finally, we can load the data into our relational data store, and use it in models or calculations.
Transforming Nodes or Edges#
In general, we want to avoid transformations, as it is cleaner and less future work to use the original data as it was provided. Transformations are a necessary evil used for two main purposes:
To align the data with a given data schema or set of modelling assumptions
To allow for linking, and especially for linking across databases.
Linking is always finding a suitable edge candidate, so transformations for linking are almost always on edges, not nodes, as we can normally assume that imported data is internally consistent. Transforming nodes can be done, though; here are some common scenarios:
You want all data to be harmonized to a certain taxonomy. Brightway prefers that units be written in full, for example, so you could normalize all unit strings to “kilogram”, etc.
You want to reverse changes made by a data provider. For example, some databases are built on top of other databases, and if we have those other databases already imported, we would like to remove the existing link and re-link to our versions of the background data. This normally means removing the copied nodes in the imported data.
You need to adapt the provided data to better fit a specific set of assumptions, a data quality standard, or a data schemas.
Static Transformations#
Static transformations are pre-computed, and come prepared in a fixed format. Brightway
has a number of static transformations available in the randonneur_data library.
Static transformations are great when they work. Please be careful to read their descriptions, match versions precisely, and check to make sure they are doing what you expect! Static transformations will also not perform linking - they will prepare the data to be compatible so that it can be linked afterwards.
The most common static transformations in normal practice are updating references from one background database version to a newer version before linking, and undoing modifications to allow for linking against original data. Here is an example of updating references:
import randonneur_data as rd
registry = rd.Registry()
registry.sample('ecoinvent-3.9.1-biosphere-ecoinvent-3.10-biosphere')
{
'replace': [{
'source': {
'name': 'Xylene',
'uuid': '0e6cf9f9-44ff-4395-ad3b-36109a32e6eb'
},
'target': {
'name': 'Xylenes, unspecified',
'formula': 'C8H10',
'uuid': '0e6cf9f9-44ff-4395-ad3b-36109a32e6eb'
},
'comment': 'Flow attribute change not listed in change report'
},
{
'source': {
'uuid': 'ce11d77b-c85a-4f03-816e-714dcda260ea',
'name': 'Americium-241'
},
'target': {
'uuid': 'a0e98cdc-79cd-4073-9713-c1a48238883a',
'name': 'Americium-241'
}
}]
}
Similarly, here is a transformation from one nomenclature system to another for the same data:
import randonneur_data as rd
registry = rd.Registry()
registry.sample('simapro-ecoinvent-3.9.1-cutoff', 1)
{
'replace': [{
'source': {
'identifier': 'EI3ARUNI000011519609842',
'name': 'Ventilation of dwellings, decentralized, 6 x 120 m3/h {CH}| ventilation of dwellings, decentralized, 6 x 120 m3/h, polyethylene ducts, with earth tube heat exchanger | Cut-off, U',
'platform_id': '35E3AE4E-798C-40F1-A9A2-626BE36D8367'
},
'target': {
'filename': '2578e5c8-8fdd-5a68-bb93-525f992fafd8_60c21de1-a8e8-4f50-8949-78759bf96c62.spold',
'name': 'ventilation of dwellings, decentralized, 6 x 120 m3/h, polyethylene ducts, with earth tube heat exchanger',
'location': 'CH',
'reference product': 'ventilation of dwellings, decentralized, 6 x 120 m3/h',
'unit': 'm2*year'
}
}
]}
The complete transformation data format includes useful metadata in addition to the raw changes.
Applying Static Transformations#
TBD
In order to use these transformations, we need to think about and specify the following:
Do I want to do destructive changes (delete and create), or just modifications (replace, update, and disaggregate)?
Are the labels used in the transformation data correct for my data schema?
What fields do we want to use for matching to determine if the transformation should be applied?
Creating New Static Transformations#
TBD
Static transformations should be added to a randonneur_data
regis
See bw2io.importers.base_lci.LCIImporter.randonneur
on how to customize these choices.
Prepared static transformations can be generated in two ways - programmatically, or manually. Programmatic generation is always preferred, as it should be tested, is reproducible, and can be adapted for new contexts or input lists. However, manual matching is sometimes necessary. Manual matches should be noted clearly in the transformation file comment.
The flowmapper tool allows for automatic matching of elementary flow lists. Technosphere flow matching is usually custom developed, as there are few generic patterns to follow. The ecoinvent_migrate library is an exception, and generates prepared static transformations for ecoinvent technosphere and biosphere upgrades.
Dynamic Transformations#
Dynamic transformations are done programmatically - i.e. a function takes in a dataset, does some changes, and returns the altered dataset. There are many such functions already available in the Brightway framework. Here are a few:
bw2io.strategies.generic.convert_uncertainty_types_to_integers
bw2io.strategies.ecospold2.add_cpc_classification_from_single_reference_product
bw2io.strategies.ecospold2.remove_uncertainty_from_negative_loss_exchanges
In general, even if you are just writing a quick script for a one-off import, it’s best practice to write transformation functions which have documentation and some simple unit tests. This development work forces you to think about what you are trying to accomplish, and what could go wrong.
Here are some examples of typical harmonization steps:
Next, each dataset is normalized or transformed to make it better conform to what Brightway2 expects. This could mean, for example, copying the only production exchange to the list of
products
, or normalizing the units or biosphere category names. This step could also include applying migrations, which are additional dataset that can be used to transform data to new forms. For example, SimaPro changes ecoinvent activity and product names, and thesimapro-ecoinvent-3
changes these names back to what ecoinvent provided. Migrations are explained in more detail below.The third step is to link exchanges to activities within the imported data. Brightway2 has a powerful generic linking function called
link_iterable_by_fields
that does the heavy lifting. This function will link an exchange if the fields match, i.e. it has the same name, location, unit, etc.link_iterable_by_fields
can also be told to only link certain types of exchanges, such as biosphere exchanges.Many imported datasets will link to other databases already installed on your computer. You can link these exchanges using the
.match_database()
function. You can customize this function by specifying the fields to use, as well as other options.You should then check on the quality of linking using the
statistics()
function, which will tell you how many exchanges are in the data, and how many unlinked exchanges are present, as well as the types of unlinked exchanges.You are finally ready to choose what to do with the imported data. If all exchanges are linked, you can write a new database with
.write_database()
. You can also save your work with.write_unlinked(name)
, which will save a new unlinked database for further processing at a later time. You can also write details on linking with.write_excel()
, which can write the entire data or just the unlinked exchanges. Of course, you can always continue with steps 2, 3, and 4, refining your linking until you are satisfied.
If this seems a bit overwhelming, that’s because it is - and a huge
pain. The current data formats and lack of well-defined strategies for
interchange between databases and even updating databases makes life
much more difficult than it should be. There are concrete examples of
importing databases in example-io-notebooks
.
Importing from ecospold 1
#
Importing from ecospold 1 is relatively simple. Multioutput products are
allocated to single output products using the given allocation factors
using the strategy es1_allocate_multioutput
. The reference product is
then assigned using the strategy assign_only_product_as_production
.
Next, some basic data cleanup is performed. Integer codes are removed,
as these are not used consistently by different LCA software
(clean_integer_codes
). Unspecified subcategories are removed (i.e.
('air', 'unspecified')
is changed to ('air',)
) using
drop_unspecified_subcategories
. Biosphere exchange names and
categories are normalized using normalize_biosphere_categories
and
normalize_biosphere_names
. Biosphere exchanges are removed, as
biosphere flows do not have locations (strip_biosphere_exc_locations
).
Next, a unique activity code is generated for each dataset, using a
combination of the name, categories, location, and unit
(set_code_by_activity_hash
).
Finally, biosphere flows are linked to the default biosphere database,
and internal technosphere flows are linked using
link_technosphere_by_activity_hash
.
Import#
Importing from ecospold 2
#
Importing from ecospold 2 is a bit complex, because although ecospold 2
gives unique IDs for many fields, which helps in linking, the current
implementation has some known issues
(http://www.ecoinvent.org/database/ecoinvent-version-3/ecoinvent-v30/known-data-issues/)
which have to be resolved or ignored by the importer.
Warning
Brightway2 cannot precisely reproduce the LCI and LCIA results given by the ecoinvent centre. The technosphere matrix used by ecoinvent cannot be reproduced from the provided unit process datasets. However, the differences for most products are quite small.
We start by removing some exchanges from most datasets. Specifically, we
remove exchanges with amounts of zero, both coproducts and technosphere
or biosphere inputs (remove_zero_amount_coproducts
and
remove_zero_amount_inputs_with_no_activity
).
We then assign reference products. Although each unit process should
have a single output, coproducts which have been allcoated away are
often still included, with amounts of zero. We use two strategies to
choose the reference product:
es2_assign_only_product_with_amount_as_reference_product
and
assign_only_product_as_production
.
Next, a composite code is generated, using the UUID of the activity and
the product (create_composite_code
).
Biosphere flow exchanges are now normalized
(drop_unspecified_subcategories
) and linked
(link_biosphere_by_flow_uuid
). Internal technosphere exchanges are
also linked, using the composite codes
(link_internal_technosphere_by_composite_code
).
Not all technosphere exchanges are linked, however. We need to drop two
different types of exchanges, as we have no way of linking them. First,
there are some exchanges with listed products but no listed activities -
and no activity in the database produces these products. Removal is done
with the strategy delete_exchanges_missing_activity
.
Additionally, there are some exchanges with listed products and
activities - but the given activity doesn’t produce the listed product.
These exchanges also have to be deleted, using the strategy
delete_ghost_exchanges
.
Note
Ecoinvent 3.1 includes some dummy biosphere flows (Fluoranthene_temp
, Chrysene_temp
, etc.). They can be safely deleted using using .drop_unlinked(True)
.
Importing from SimaPro#
Importing SimaPro CSV files is also a bit of a headache. Pré, the makers of SimaPro, have done a lot of work to make LCA software accessible and understandable. This work includes making changes to process names and other metadata, which makes linking these processes back to original ecoinvent data difficult. Fortunately, Pré has been very helpful is supplying correspondence files, which we can use to move (to the best of our ability) from the “SimaPro world” to “ecoinvent world”.
Note
Importing SimaPro XML export files is not recommended, as there are bugs with exporting ecoinvent 3 processes.
Importing from the Standard Excel Template#
Note
You can see these ideas in practive in basic database
(https://github.com/brightway-lca/brightway2-io/blob/master/tests/fixtures/excel/basic_example.xlsx?raw=true)
and parameterized database
(https://github.com/brightway-lca/brightway2-io/blob/master/tests/fixtures/excel/sample_activities_with_variables.xlsx?raw=true)
Excel templates.
You can define inventory datasets in many ways in Excel, but there is a
standard template which is supported by the ExcelImporter
. The
template should be pretty self-explanatory, and follows this general
format:
Project parameters (optional)
<column label, <column label, <column label
<value, <value, <value
Database, <name of database
<database field name, <database field value
Database Parameters (Optional)
<column label, <column label, <column label
<value, <value, <value
Activity, <name of activity
<database field name, <database field value
(Optional blank line(s))
Parameters (Optional), <name of parameter group (optional)
<column label, <column label, <column label
<value, <value, <value
(Optional blank line(s))
Exchanges
<column label, <column label, <column label
<value, <value, <value
<value, <value, <value
Data can be split over many worksheets. To mark a worksheet as one that should be skipped by the importer, put the string “skip” in cell 1A.
Sections are split by the presence of blank lines. There are two kinds
of sections: data and metadata. Metadata sections are used for the
descriptions of databases (Database
) and activities (Activity
, until
Exchanges
). In metadata sections, only the first two columns are
read - all other columns are ignored. In the data sections —
Project parameters
, Parameters
, and Exchanges
- all columns are
read, and should have column headings. You can limit the number of
columns read by inserting the string cutoff
in cell 1A, and the column
index after which the importer should start ignoring columns in cell
1B. For example, if you put 4
in cell 1B, the first four columns would
be included, and the rest ignored.
Keep in mind the following general restrictions and notes:
Provide useful codes! There is no reason to use e.g. UUIDs, unless you are feeling masochistic. You can also skip providing codes at all, they will be autogenerated.
As in the activity data format, there are no required fields, and you can also insert your own.
Each workbook can only describe data for one
Database
. MultipleDatabase
sections will raise an error!Input exchanges must be linked using the data provided in the spreadsheet. That means that you may need to provide an activity name and a reference product when trying to link against ecoinvent.
Linking is done using the column headings exactly as provided. You should check to make sure the capitalization, spelling, and spacing are consistent with the attributes in the background databases you want to link against.
Worksheet names don’t have any special meaning to the importer, and are only used to indicate where errors occurred.
Parameters are a bit different - as they are trickier to handle, their sections are stricter:
All parameter sections must have the column “name”. Names should be unique within a parameter section.
Stochastic variables should have the usual uncertainty columns, e.g. uncertainty type, loc, scale, minimum, maximum.
Variables which are defined by formulas should have the column
Parameters can have an “amount” column, but if there is also a “formula” column, the amount will be overwritten on import as the formula is evaluated.
Formulas should be for consumption by Python, not Excel; they will be interpreted by
asteval
(https://newville.github.io/asteval/).The
Database
,Project parameters
, andDatabase parameters
don’t have to be in any order or in any particular worksheet. There can be multipleProject parameters
andDatabase parameters
sections, they will be concatenated.
The following data transformations are applied by the ExcelImporter
:
Numbers are translated from text into actual numbers.
Tuples, separated in the cell by the
::
string, are reconstructed, i.e. “this::example” would become("this", "example")
.The strings “True” and “False” (by themselves, not as part of a larger string) are transformed to their respective boolean values.
Fields with the value
(Unknown)
are omitted on a per-row basis.
Importing LCIA Methods from the Standard Excel Template#
Proper implementation of life cycle impact assessment methods requires a complete set of metadata. However, for simple LCIA methods, Brightway
has a simplified importer for Excel and CSV files: bw2io.ExcelLCIAImporter
and bw2io.CSVLCIAImporter
. These function the same as other importer (i.e. you extract data, apply strategies until you are satisfied with the matching, and write the modified data). There are example Excel workbooks and CSV files.
Please note the following:
Excel workbooks should only have one worksheet.
Default matching to biosphere flows is based on ``name
and
categories`.
The convention forcategories
is the same as for inventory datasets: subcategories should be separated with ::
, e.g.natural resource::in ground
. You can also list the main category and subcategory in separate columns, and use a strategy function to combine them. * The call function to instantiate the LCIA importer objects requiresfilepath
,name
(method name as a tuple),description
, andunit
. What to do with unmatched exchanges? If there are unlinked exchanges, you have several options. If you aren’t sure what to do yet, you can save a temporary copy (that can be loaded later) using .write_unlinked("some name")
. Calling .statistics()
will show what kind of exchanges aren’t linked, e.g.:
> sp.statistics()
366 datasets 3991 exchanges 2639 unlinked exchanges
Type biosphere: 170 unique unlinked exchanges
Type technosphere: 330 unique unlinked exchanges
The options to examine or resolve the unlinked exchanges are:
You can write a spreadsheet of the characterization factors, including their linking status, with
.write_excel("some name")
.You can apply new linking strategies with
.apply_strategies(some_new_strategy)
. Note that this method requires a list of strategies.You can match technosphere or biosphere exchanges to other background databases using
.match_database("another database")
.
To resolve unlinked biosphere exchanges which simply don’t exist in your current biosphere database, you can:
Add them to the biosphere database with
add_unlinked_flows_to_biosphere_database()
Create a new biosphere database with
create_new_biosphere("new biosphere name")
Add the biosphere flows to the database you are currently working on (LCI databases can include both process and biosphere flows) with `add_unlinked_biosphere_flows_to_current_database()
Note
These methods have several options, and you should understand what they do and read their documentation before choosing between them.
Note
You can’t write an LCI database with unlinked exchanges.
Migrations#
Sometimes the only way to correctly link activities or biosphere flows is by applying a list of name (or other field) transforms. For example,SimaPro will export a process named “`sulfonyl`urea-compound {RoW}| production | Alloc Rec, S”, which corresponds to the ecoinvent process “`sulfonyl`urea-compound production”, with reference product “`sulfonyl`urea-compound” and location “RoW”. In another example, in ecoinvent 2, emissions of water to air were measured in kilograms, and in ecoinvent 3, emissions of water to air are measured in cubic meters. In this case, our migration would look like this:
{
'fields': `'name', 'categories', 'type', 'unit'`,
'data': `
(
## First element is input data in the order of `fields` above
('Water', ('air',), 'biosphere', 'kilogram'),
## Second element is new values to substitute
{
'unit': 'cubic meter',
'multiplier': 0.001
}
)
`
}
We call the application of transform lists “migrations”, and they are applied with the .migrate(migrations_name)
method.
If the numeric values in an exchange need to changed, the special key ‘multiplier’ is used, where new_amount = multiplier * old_amount. Uncertainty information and formulas are adjusted automatically, if possible (see utils.rescale_exchange
).
A few additional notes:
Migrations change the underlying data, but do not do any linking - you will also have to apply linking strategies after a migration.
Migrations can specify any number of fields, but of course the fields must be present in the importing database.
TODO: Migrations can be specified in an excel template. Template files must be processed using
convert_migration_file
.Subcategories are not expanded automatically, so a separate row in the migrations file would be needed for e.g.
water (air, non-urban air or from high stacks)
.
Importing an LCIA Method#
LCIA methods can be imported from ecospold 1 XML files
(EcoinventLCIAImporter
) and SimaPro CSV files
(SimaProLCIACSVImporter
).
When importing an LCIA method or set of LCIA methods, you should specify
the biosphere database to link against e.g.
EcoinventLCIAImporter("some file path", "some biosphere database name")
.
If no biosphere database name is provided, the default biosphere3
database is used.
Both importers will attempt to normalize biosphere flow names and categories to the ecospold2 standard, using the strategies:
normalize_simapro_lcia_biosphere_categories
normalize_simapro_biosphere_names
normalize_biosphere_names
normalize_biosphere_categories
Next, the characterization factors are examined to see if they are only
given for root categories, e.g. ('air',)
and not
('air', 'urban air close to ground')
. If only root categories are
characterized, then we assume that the characterization factors also
apply to all subcategories, using the strategy match_subcategories
.
Finally, linking to the given or default biosphere database is
attempted, using the strategy link_iterable_by_fields
and the standard
fields: name, categories, unit, location. Note that biosphere flows do
not actually have a location.
You can now check the linking statistics. If all biosphere flows are
linked, write the LCIA methods with .write_methods()
. Note that
attempting to write an existing method will raise a ValueError
unless
you use .write_methods(overwrite=True)
, and trying to write methods
which aren’t completely linked will also raise a ValueError
.
If there are unlinked characterization factors, you have several
options. If you aren’t sure what to do yet, you can save a temporary
copy (that can be loaded later) using .write_unlinked("some name")
.
The options to examine or resolve the unlinked characterization factors
are:
You can write a spreadsheet of the characterization factors, including their linking status, with
.write_excel("some name")
.You can apply new linking strategies with
.apply_strategies(
some_new_strategy)
. Note that this method requires a list of strategies.TODO: You can write all biosphere flows to a new biosphere database with
.create_new_biosphere("some name")
.If you are satisfied that you don’t care about the unlinked characterization factors, you can drop them with
.drop_unlinked()
.Alternatively, you can add the missing biosphere flows to the biosphere database using
.add_missing_cfs()
.
Custom Strategies#
A strategy
is just a Python callable (usually a function) that follows
a few simple conventions: it takes the input data as a single input
argument, and returns the modified data as the only returned value.
Inventory data is store in {ImporterClass}.data
, and has the following
form:
`
{
'name': "some name",
..., ## other metadata in key: value form
'exchanges': `
{
'amount': float,
..., ## other data in key: value form
}
`
}
`
Here is an example strategy (from
bw2io.strategies.csv.py
(https://github.com/brightway-lca/brightway2-io/blob/master/bw2io/strategies/csv.py)):
def csv_add_missing_exchanges_section(data):
for ds in data:
if "exchanges" not in ds:
ds`"exchanges"` = ``
return data
This is short function, and is not complicated - most strategies will do more.
Data Format for LCIA Methods#
In order to allow for the re-use of strategies, the data format for LCIA
methods is the same as for inventory databases. name
is the name (as a
tuple) of the impact category, and exchanges
is the list of
characterization factors. It is a bit awkward, but makes other things
easier.
Adding Other Inputs to a Strategy#
Strategy functions should only take one input, but sometimes you want to add other input arguments. For example, imagine the following silly strategy:
def add_silly_string(data, string):
for ds in data:
ds`'name'` += string
return data
If you tried to use this, it would raise an error, as string
wouldn’t
be defined. The right approach here is
currying
(https://en.wikipedia.org/wiki/Currying), which allows you to
give the values of some input arguments in advance:
from functools import partial
add_silly_string_fixed = partial(add_silly_string, string="something silly")
You would then apply the strategy add_silly_string_fixed
.
Applying Custom Strategies#
You apply strategies using the
{ImporterClass}.apply_strategy(name_of_callable)
method. You could
also append your custom strategy to {ImporterClass}.strategies
, so it
is used when you call {ImporterClass}.apply_strategies
.