[DFDL-WG] DFDL implementation support for element refs

Steve Hanson smhdfdl at gmail.com
Thu May 9 11:33:07 PDT 2024


Hi Mike

I get the rationale behind having a restricted subset that makes mapping to
non-namespaced targets easier. But that isn't a reason to error the use of
element references. I think you just need to error the occurrence of any
target namespace in the root schema, and any use of xsd:import for a
non-DFDL namespace. That also stops the use of namespaces for simple type
restrictions and groups, but I think that's fine. If you don't want
namespaces involved, outlaw them completely.

Regards
Steve


On Thu, May 9, 2024 at 2:53 PM Mike Beckerle <mbeckerle at apache.org> wrote:

> I agree element references are useful in the way you state, and we've
> built many DFDL schemas in this manner.  It's nice to build XML Schemas
> this way, it really has little to do with DFDL other than being a pattern
> that encourages unit testing of individual record types, which is good
> practice.
>
> The rest of this is probably TL;DR, but rationalizes why we don't need any
> change to DFDL, just a flag in Daffodil to escalate a particular warning to
> an SDE.
>
> The challenge is when you need to describe data in DFDL, and then pass it
> to something that does not accept XML, or treats XML as a giant string, so
> that you really want to instead map the DFDL infoset directly to the native
> data structures, not use an XML String.  And... that native data structure
> has simple names, not namespaced names.
>
> For example, define and populate Java POJOs from DFDL-described data.
> Types aka classes can be in packages with complex names that are like
> namespaces. But element names aka members of classes must have simple names
> like "A-Za-z09_" as the chars allowed. You can make big long "simple"
> names, but that's undesirable.
>
> If a DFDL (or XSD) Schema has two elements that are peer children within
> the same parent, and they differ in QName only by namespace, XSD has no
> issue with it.
> Daffodil will issue a warning (which can be suppressed) that this will be
> incompatible with data that has no namespaces. If you then convert such
> data to say, JSON, it will happily populate that same local name with
> different things, resulting in data that cannot be unparsed, can't be
> reliably queried, can't have a JSON-Schema, etc.
>
> If the two elements have different namespace prefixes, one could append
> the prefix to the element name, and define POJO Java class based on the
> global element declaration - using the namespace and prefix to create a
> unique class name.
>
> But it is possible for the two elements to have the same prefix - as it
> can be redefined in any enclosing context. In that case one must generate a
> unique name given the namespace - probably by just adding a numeric suffix
> to the prefixes to make those prefixes unique.
>
> So it is possible (though a bit complex) to minimize this stuff, and
> generate unique names to get out of the way of this problem, however, this
> makes the data harder to use.
> As an example, Apache Drill is a "use SQL on anything" tool. We've built
> an integration (mostly, not quite done) which allows it to query any data
> described by a DFDL schema in combination with any of the other databases
> and types it can query.
>
> But its data model does not have element namespaces. For now we just fail
> if you have two elements that differ only by namespace. I.e., your DFDL
> schema is considered not suitable for Drill querying, and we suggest you
> change the schema.
>
> To avoid this in advance I'm thinking of a Daffodil flag that escalates
> this name conflict warning (same name different namespaces) to an SDE, so
> that people will proactively get rid of it.
>
> Unfortunately, we have found this element name problem sneaks into
> schemas. It occurs naturally if you are trying to create schemas that
> simultaneously handle multiple versions of the same data format. You end up
> wanting to have the same element name in one branch of a choice in a
> namespace for version 1, and the same element name in another branch of a
> choice for version 2, where the choice is discriminated by the version
> information. There is no getting around that when querying such data, a
> query (such as XPath) can only be polymorphic over versions if you are able
> to ignore/bypass the namespace part and use only the local name of the
> element in the query language. This can be done in XQuery or XPath using
> 'predicates' that match on fn:local-name(). DFDL expressions cannot do this
> as we only allow indexing in predicates.
>
>
>
>
>
>
>
>
> On Tue, May 7, 2024 at 3:51 PM Steve Hanson <smhdfdl at gmail.com> wrote:
>
>> Mike
>>
>> IBM DFDL as used by ACE has supported element refs since day one. They
>> are really useful, as shown in the DFDL schemas for EDIFACT.  Each EDIFACT
>> message is a global element, so can be parsed on its own. But there is also
>> the EDIFACT interchange global element, which is a collection of EDIFACT
>> messages, so the natural approach is to use element refs to pull in the
>> EDIFACT messages.
>>
>> I'll try and join on Thursday but I am away Wed and Thurs, it all depends
>> when I get home.
>>
>> Regards
>> Steve
>>
>> On Mon, May 6, 2024 at 11:20 PM Mike Beckerle <mbeckerle at apache.org>
>> wrote:
>>
>>> I'm interested in what DFDL implementations support element references?
>>>
>>> IBM ACE?
>>> IBM zTPF?
>>> DFDL4Space?
>>>
>>> Can you let me know whether these implementations support element refs?
>>>
>>> The reason I ask is below, which may be of interest or perhaps TL;DR.
>>>
>>> We support element references in Daffodil, but I'm coming around to the
>>> view that element refs are a bad idea in DFDL schemas.
>>>
>>> They're not needed for any specific data format expressive power. That
>>> suggests we should have left them out of DFDL, but for some reason we
>>> didn't.
>>>
>>> The problem is that most data languages have nothing like element
>>> references and the associated element namespace management complexity
>>> available.
>>>
>>> So as soon as you want to use a DFDL schema but not use it to
>>> interchange data as XML, element refs become a problem.
>>>
>>> I'm playing around with a best practice/subset/profile suggestion where:
>>>
>>> * The only global element declarations in the schema are for root
>>> elements.
>>> * Element references are disallowed
>>> * The root elements are declared in a root schema file that contains
>>> ONLY the root elements
>>> * Root elements should always be declared by one-liners like this:
>>> `<element name="rootElement" type="prefix:rootElementType"/>`
>>> * The root elements schema file has no target namespace.
>>> * All group, type, and DFDL format/escapeScheme/variable definitions
>>> must be declared in different schema files that may (and probably should)
>>> have a target namespace.
>>>
>>> The benefit of these restrictions is that the elements in the nest of a
>>> DFDL infoset never have any namespaces.
>>> This makes them compatible with non-namespaced data systems like JSON,
>>> Apache Drill, Apache NiFi, Generated C code, etc.
>>> This makes integration with those things *massively* simpler.
>>>
>>> Such schemas are still easily reused by reusing the type of the root
>>> element, so there is no need to ever use an element reference, and a nice
>>> composition property occurs - you don't need element references to assemble
>>> schemas from component schemas, and the assembled component has the same
>>> characteristic.
>>>
>>> There are a few other things this discipline also simplifies. Reusing
>>> test data becomes simpler if namespace URIs aren't getting embedded in
>>> every test infoset XML file, for example.
>>>
>>> All comments are welcome.
>>>
>>> Mike Beckerle
>>> Apache Daffodil PMC | daffodil.apache.org
>>> OGF DFDL Workgroup Co-Chair |
>>> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
>>> Owl Cyber Defense | www.owlcyberdefense.com
>>>
>>>
>>> --
>>>   dfdl-wg mailing list
>>>   dfdl-wg at lists.ogf.org
>>>   https://lists.ogf.org/mailman/listinfo/dfdl-wg
>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 10557 bytes
Desc: not available
URL: <https://lists.ogf.org/pipermail/dfdl-wg/attachments/20240509/db34b0ef/attachment.txt>


More information about the dfdl-wg mailing list