Skip to content

PyPI version

excel2xml: Convert a data source to XML

dsp-tools assists you in converting a data source in CSV/XLS(X) format to an XML file. There are two use cases for a transformation from Excel/CSV to XML:

  • The CLI command dsp-tools excel2xml creates an XML file from an Excel/CSV file which is already structured according to the DSP specifications. This is mostly used for DaSCH-interal data migration. The CLI command is documented here.
  • The module excel2xml can be imported into a custom Python script that transforms any tabular data into an XML. This use case is more frequent, because data from research projects have a variety of formats/structures. This document only treats the excel2xml module.

How to use the module excel2xml

At the end of this document, you find a sample Python script. In the following, it is explained how to use it.

General preparation

Insert your ontology name, project shortcode, and the path to your data source. If necessary, activate one of the lines that are commented out.
Then, the root element is created, which represents the <knora> tag of the XML document. As first children of <knora>, some standard permissions are added. At the end, please carefully check the permissions of the finished XML file if they meet your requirements, and adapt them if necessary.
The standard permission of a resource is "res-default", and of a property "prop-default". If you don't specify it otherwise, all resources and properties get these permissions. With excel2xml, it is not possible to create resources/ properties that don't have permissions, because they would be invisible for all users except project admins and system admins. Read more about permissions here.

Create list mappings

Let's assume that your data source has a column containing list values named after the "label" of the JSON project list, instead of the "name" which is needed for the dsp-tools xmlupload. You need a way to get the names from the labels. If your data source uses the labels correctly, this is an easy task: The method create_json_list_mapping() creates a dictionary that maps the labels to the names.
If, however, your data source has spelling variants, you need the more sophisticated approach of create_json_excel_list_mapping(): This method creates a dictionary that maps the list values in your data source to their correct JSON project node name. This happens based on string similarity. Please carefully check the result if there are no false matches!

Create all resources

With the help of the Python pandas library, you can then iterate through the rows of your Excel/CSV, and create resources and properties. Some examples of useful helper methods are:

Create an ID for a resource

The method make_xsd_id_compatible(string) makes a string compatible with the constraints of xsd:ID, so that it can be used as ID of a resource.

Create a property

For every property, there is a helper function that explains itself when you hover over it. It also has a link to the dsp-tools documentation of this property. So you don't need to worry how to construct a certain XML value for a certain property.

For make_boolean_prop(cell), the following formats are supported:

  • true: True, "true", "True", "1", 1, "yes", "Yes"
  • false: False, "false", "False", "0", 0, "no", "No"

Check if a cell contains a usable value

The method check_notna(cell) checks a value if it is usable in the context of data archiving. A value is considered usable if it is - a number (integer or float, but not np.nan) - a boolean - a string with at least one Unicode letter, underscore, or number, but not "None", "", "N/A", or "-" - a PropertyElement whose "value" fulfills the above criteria

Calendar date parsing

The method find_date_in_string(string) tries to find a calendar date in a string. If successful, it returns the DSP-formatted date string.

Notes:

  • The date can be embedded in text.
  • Only the first date found is returned.
  • By default, dates are interpreted as CE (Christian era) in the Gregorian calendar.
  • The years 0000-2999 are supported, in 4-digit form.
  • Dates written with slashes are always interpreted in a European manner: 5/11/2021 is the 5th of November.

Currently supported date formats:

Input Output
0476_09_04 GREGORIAN:CE:0476-09-04:CE:0476-09-04
0476-09-04 GREGORIAN:CE:0476-09-04:CE:0476-09-04
30.4.2021 GREGORIAN:CE:2021-04-30:CE:2021-04-30
5/11/2021 GREGORIAN:CE:2021-11-05:CE:2021-11-05
Jan 26, 1993 GREGORIAN:CE:1993-01-26:CE:1993-01-26
February26,2051 GREGORIAN:CE:2051-02-26:CE:2051-02-26
28.2.-1.12.1515 GREGORIAN:CE:1515-02-28:CE:1515-12-01
25.-26.2.0800 GREGORIAN:CE:0800-02-25:CE:0800-02-26
1.9.2022-3.1.2024 GREGORIAN:CE:2022-09-01:CE:2024-01-03
1848 GREGORIAN:CE:1848:CE:1848
1849/1850 GREGORIAN:CE:1849:CE:1850
1849/50 GREGORIAN:CE:1849:CE:1850
1845-50 GREGORIAN:CE:1845:CE:1850

Complete example

Save the following files into a directory, and run the Python script. The features discussed in this document are contained therein.


Last update: 2022-06-10