From mbeckerle at apache.org Mon May 6 14:32:23 2024 From: mbeckerle at apache.org (Mike Beckerle) Date: Mon, 6 May 2024 17:32:23 -0400 Subject: [DFDL-WG] Reminder: Agenda topics for Thursday May 8 DFDL call Message-ID: I believe the requested agenda topic for the May 8 call was discussion of DFDL 2.0 and/or DFDL 1.1? Please circulate emails about topics in that area you'd like to discuss on the call, or generally. There is a problem called the "second system effect" I think which is when version 2.0 never happens because the team wants to fix every single thing wrong with version 1.0, and perfectionism sets in. We want to avoid that. There is also a role for what I would call "standard profiles" which are knobs to turn to tell a DFDL processor that you want to limit your DFDL usage so that your schemas will have certain standard behaviours. An example is to disallow elements to be distinguished only by namespace, so that the DFDL schema can be used to convert data to/from JSON or other data frameworks where elements have only names, and there is no notion of namespaces for names. We've run into this in JSON and a major integration of Daffodil with Apache Drill. DFDL users would benefit from a way to say "keep me out of namespace troubles" like that. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2258 bytes Desc: not available URL: From mbeckerle at apache.org Mon May 6 15:19:17 2024 From: mbeckerle at apache.org (Mike Beckerle) Date: Mon, 6 May 2024 18:19:17 -0400 Subject: [DFDL-WG] DFDL implementation support for element refs Message-ID: I'm interested in what DFDL implementations support element references? IBM ACE? IBM zTPF? DFDL4Space? Can you let me know whether these implementations support element refs? The reason I ask is below, which may be of interest or perhaps TL;DR. We support element references in Daffodil, but I'm coming around to the view that element refs are a bad idea in DFDL schemas. They're not needed for any specific data format expressive power. That suggests we should have left them out of DFDL, but for some reason we didn't. The problem is that most data languages have nothing like element references and the associated element namespace management complexity available. So as soon as you want to use a DFDL schema but not use it to interchange data as XML, element refs become a problem. I'm playing around with a best practice/subset/profile suggestion where: * The only global element declarations in the schema are for root elements. * Element references are disallowed * The root elements are declared in a root schema file that contains ONLY the root elements * Root elements should always be declared by one-liners like this: `` * The root elements schema file has no target namespace. * All group, type, and DFDL format/escapeScheme/variable definitions must be declared in different schema files that may (and probably should) have a target namespace. The benefit of these restrictions is that the elements in the nest of a DFDL infoset never have any namespaces. This makes them compatible with non-namespaced data systems like JSON, Apache Drill, Apache NiFi, Generated C code, etc. This makes integration with those things *massively* simpler. Such schemas are still easily reused by reusing the type of the root element, so there is no need to ever use an element reference, and a nice composition property occurs - you don't need element references to assemble schemas from component schemas, and the assembled component has the same characteristic. There are a few other things this discipline also simplifies. Reusing test data becomes simpler if namespace URIs aren't getting embedded in every test infoset XML file, for example. All comments are welcome. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3766 bytes Desc: not available URL: From smhdfdl at gmail.com Tue May 7 12:51:10 2024 From: smhdfdl at gmail.com (Steve Hanson) Date: Tue, 7 May 2024 20:51:10 +0100 Subject: [DFDL-WG] DFDL implementation support for element refs In-Reply-To: References: Message-ID: Mike IBM DFDL as used by ACE has supported element refs since day one. They are really useful, as shown in the DFDL schemas for EDIFACT. Each EDIFACT message is a global element, so can be parsed on its own. But there is also the EDIFACT interchange global element, which is a collection of EDIFACT messages, so the natural approach is to use element refs to pull in the EDIFACT messages. I'll try and join on Thursday but I am away Wed and Thurs, it all depends when I get home. Regards Steve On Mon, May 6, 2024 at 11:20?PM Mike Beckerle wrote: > I'm interested in what DFDL implementations support element references? > > IBM ACE? > IBM zTPF? > DFDL4Space? > > Can you let me know whether these implementations support element refs? > > The reason I ask is below, which may be of interest or perhaps TL;DR. > > We support element references in Daffodil, but I'm coming around to the > view that element refs are a bad idea in DFDL schemas. > > They're not needed for any specific data format expressive power. That > suggests we should have left them out of DFDL, but for some reason we > didn't. > > The problem is that most data languages have nothing like element > references and the associated element namespace management complexity > available. > > So as soon as you want to use a DFDL schema but not use it to interchange > data as XML, element refs become a problem. > > I'm playing around with a best practice/subset/profile suggestion where: > > * The only global element declarations in the schema are for root elements. > * Element references are disallowed > * The root elements are declared in a root schema file that contains ONLY > the root elements > * Root elements should always be declared by one-liners like this: > `` > * The root elements schema file has no target namespace. > * All group, type, and DFDL format/escapeScheme/variable definitions must > be declared in different schema files that may (and probably should) have a > target namespace. > > The benefit of these restrictions is that the elements in the nest of a > DFDL infoset never have any namespaces. > This makes them compatible with non-namespaced data systems like JSON, > Apache Drill, Apache NiFi, Generated C code, etc. > This makes integration with those things *massively* simpler. > > Such schemas are still easily reused by reusing the type of the root > element, so there is no need to ever use an element reference, and a nice > composition property occurs - you don't need element references to assemble > schemas from component schemas, and the assembled component has the same > characteristic. > > There are a few other things this discipline also simplifies. Reusing test > data becomes simpler if namespace URIs aren't getting embedded in every > test infoset XML file, for example. > > All comments are welcome. > > Mike Beckerle > Apache Daffodil PMC | daffodil.apache.org > OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl > Owl Cyber Defense | www.owlcyberdefense.com > > > -- > dfdl-wg mailing list > dfdl-wg at lists.ogf.org > https://lists.ogf.org/mailman/listinfo/dfdl-wg > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4948 bytes Desc: not available URL: From smhdfdl at gmail.com Wed May 8 01:07:51 2024 From: smhdfdl at gmail.com (Steve Hanson) Date: Wed, 8 May 2024 09:07:51 +0100 Subject: [DFDL-WG] Reminder: Agenda topics for Thursday May 8 DFDL call In-Reply-To: References: Message-ID: I assume you mean Thursday May 9th, not 8th. On Mon, May 6, 2024 at 10:32?PM Mike Beckerle wrote: > I believe the requested agenda topic for the May 8 call was discussion of > DFDL 2.0 and/or DFDL 1.1? > > Please circulate emails about topics in that area you'd like to discuss on > the call, or generally. > > There is a problem called the "second system effect" I think which is when > version 2.0 never happens because the team wants to fix every single thing > wrong with version 1.0, and perfectionism sets in. > > We want to avoid that. > > There is also a role for what I would call "standard profiles" which are > knobs to turn to tell a DFDL processor that you want to limit your DFDL > usage so that your schemas will have certain standard behaviours. > > An example is to disallow elements to be distinguished only by namespace, > so that the DFDL schema can be used to convert data to/from JSON or other > data frameworks where elements have only names, and there is no notion of > namespaces for names. We've run into this in JSON and a major integration > of Daffodil with Apache Drill. DFDL users would benefit from a way to say > "keep me out of namespace troubles" like that. > > Mike Beckerle > Apache Daffodil PMC | daffodil.apache.org > OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl > Owl Cyber Defense | www.owlcyberdefense.com > > > -- > dfdl-wg mailing list > dfdl-wg at lists.ogf.org > https://lists.ogf.org/mailman/listinfo/dfdl-wg > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2900 bytes Desc: not available URL: From mbeckerle at apache.org Wed May 8 05:14:47 2024 From: mbeckerle at apache.org (Mike Beckerle) Date: Wed, 8 May 2024 08:14:47 -0400 Subject: [DFDL-WG] Reminder: Agenda topics for Thursday May 8 DFDL call In-Reply-To: References: Message-ID: Yes. 9th. Not 8th. On Wed, May 8, 2024, 4:08?AM Steve Hanson wrote: > I assume you mean Thursday May 9th, not 8th. > > On Mon, May 6, 2024 at 10:32?PM Mike Beckerle > wrote: > >> I believe the requested agenda topic for the May 8 call was discussion of >> DFDL 2.0 and/or DFDL 1.1? >> >> Please circulate emails about topics in that area you'd like to discuss >> on the call, or generally. >> >> There is a problem called the "second system effect" I think which is >> when version 2.0 never happens because the team wants to fix every single >> thing wrong with version 1.0, and perfectionism sets in. >> >> We want to avoid that. >> >> There is also a role for what I would call "standard profiles" which are >> knobs to turn to tell a DFDL processor that you want to limit your DFDL >> usage so that your schemas will have certain standard behaviours. >> >> An example is to disallow elements to be distinguished only by namespace, >> so that the DFDL schema can be used to convert data to/from JSON or other >> data frameworks where elements have only names, and there is no notion of >> namespaces for names. We've run into this in JSON and a major integration >> of Daffodil with Apache Drill. DFDL users would benefit from a way to say >> "keep me out of namespace troubles" like that. >> >> Mike Beckerle >> Apache Daffodil PMC | daffodil.apache.org >> OGF DFDL Workgroup Co-Chair | >> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl >> Owl Cyber Defense | www.owlcyberdefense.com >> >> >> -- >> dfdl-wg mailing list >> dfdl-wg at lists.ogf.org >> https://lists.ogf.org/mailman/listinfo/dfdl-wg >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3362 bytes Desc: not available URL: From mbeckerle at apache.org Thu May 9 06:53:05 2024 From: mbeckerle at apache.org (Mike Beckerle) Date: Thu, 9 May 2024 09:53:05 -0400 Subject: [DFDL-WG] DFDL implementation support for element refs In-Reply-To: References: Message-ID: I agree element references are useful in the way you state, and we've built many DFDL schemas in this manner. It's nice to build XML Schemas this way, it really has little to do with DFDL other than being a pattern that encourages unit testing of individual record types, which is good practice. The rest of this is probably TL;DR, but rationalizes why we don't need any change to DFDL, just a flag in Daffodil to escalate a particular warning to an SDE. The challenge is when you need to describe data in DFDL, and then pass it to something that does not accept XML, or treats XML as a giant string, so that you really want to instead map the DFDL infoset directly to the native data structures, not use an XML String. And... that native data structure has simple names, not namespaced names. For example, define and populate Java POJOs from DFDL-described data. Types aka classes can be in packages with complex names that are like namespaces. But element names aka members of classes must have simple names like "A-Za-z09_" as the chars allowed. You can make big long "simple" names, but that's undesirable. If a DFDL (or XSD) Schema has two elements that are peer children within the same parent, and they differ in QName only by namespace, XSD has no issue with it. Daffodil will issue a warning (which can be suppressed) that this will be incompatible with data that has no namespaces. If you then convert such data to say, JSON, it will happily populate that same local name with different things, resulting in data that cannot be unparsed, can't be reliably queried, can't have a JSON-Schema, etc. If the two elements have different namespace prefixes, one could append the prefix to the element name, and define POJO Java class based on the global element declaration - using the namespace and prefix to create a unique class name. But it is possible for the two elements to have the same prefix - as it can be redefined in any enclosing context. In that case one must generate a unique name given the namespace - probably by just adding a numeric suffix to the prefixes to make those prefixes unique. So it is possible (though a bit complex) to minimize this stuff, and generate unique names to get out of the way of this problem, however, this makes the data harder to use. As an example, Apache Drill is a "use SQL on anything" tool. We've built an integration (mostly, not quite done) which allows it to query any data described by a DFDL schema in combination with any of the other databases and types it can query. But its data model does not have element namespaces. For now we just fail if you have two elements that differ only by namespace. I.e., your DFDL schema is considered not suitable for Drill querying, and we suggest you change the schema. To avoid this in advance I'm thinking of a Daffodil flag that escalates this name conflict warning (same name different namespaces) to an SDE, so that people will proactively get rid of it. Unfortunately, we have found this element name problem sneaks into schemas. It occurs naturally if you are trying to create schemas that simultaneously handle multiple versions of the same data format. You end up wanting to have the same element name in one branch of a choice in a namespace for version 1, and the same element name in another branch of a choice for version 2, where the choice is discriminated by the version information. There is no getting around that when querying such data, a query (such as XPath) can only be polymorphic over versions if you are able to ignore/bypass the namespace part and use only the local name of the element in the query language. This can be done in XQuery or XPath using 'predicates' that match on fn:local-name(). DFDL expressions cannot do this as we only allow indexing in predicates. On Tue, May 7, 2024 at 3:51?PM Steve Hanson wrote: > Mike > > IBM DFDL as used by ACE has supported element refs since day one. They are > really useful, as shown in the DFDL schemas for EDIFACT. Each EDIFACT > message is a global element, so can be parsed on its own. But there is also > the EDIFACT interchange global element, which is a collection of EDIFACT > messages, so the natural approach is to use element refs to pull in the > EDIFACT messages. > > I'll try and join on Thursday but I am away Wed and Thurs, it all depends > when I get home. > > Regards > Steve > > On Mon, May 6, 2024 at 11:20?PM Mike Beckerle > wrote: > >> I'm interested in what DFDL implementations support element references? >> >> IBM ACE? >> IBM zTPF? >> DFDL4Space? >> >> Can you let me know whether these implementations support element refs? >> >> The reason I ask is below, which may be of interest or perhaps TL;DR. >> >> We support element references in Daffodil, but I'm coming around to the >> view that element refs are a bad idea in DFDL schemas. >> >> They're not needed for any specific data format expressive power. That >> suggests we should have left them out of DFDL, but for some reason we >> didn't. >> >> The problem is that most data languages have nothing like element >> references and the associated element namespace management complexity >> available. >> >> So as soon as you want to use a DFDL schema but not use it to interchange >> data as XML, element refs become a problem. >> >> I'm playing around with a best practice/subset/profile suggestion where: >> >> * The only global element declarations in the schema are for root >> elements. >> * Element references are disallowed >> * The root elements are declared in a root schema file that contains ONLY >> the root elements >> * Root elements should always be declared by one-liners like this: >> `` >> * The root elements schema file has no target namespace. >> * All group, type, and DFDL format/escapeScheme/variable definitions must >> be declared in different schema files that may (and probably should) have a >> target namespace. >> >> The benefit of these restrictions is that the elements in the nest of a >> DFDL infoset never have any namespaces. >> This makes them compatible with non-namespaced data systems like JSON, >> Apache Drill, Apache NiFi, Generated C code, etc. >> This makes integration with those things *massively* simpler. >> >> Such schemas are still easily reused by reusing the type of the root >> element, so there is no need to ever use an element reference, and a nice >> composition property occurs - you don't need element references to assemble >> schemas from component schemas, and the assembled component has the same >> characteristic. >> >> There are a few other things this discipline also simplifies. Reusing >> test data becomes simpler if namespace URIs aren't getting embedded in >> every test infoset XML file, for example. >> >> All comments are welcome. >> >> Mike Beckerle >> Apache Daffodil PMC | daffodil.apache.org >> OGF DFDL Workgroup Co-Chair | >> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl >> Owl Cyber Defense | www.owlcyberdefense.com >> >> >> -- >> dfdl-wg mailing list >> dfdl-wg at lists.ogf.org >> https://lists.ogf.org/mailman/listinfo/dfdl-wg >> > -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9594 bytes Desc: not available URL: From mbeckerle at apache.org Thu May 9 06:53:49 2024 From: mbeckerle at apache.org (Mike Beckerle) Date: Thu, 9 May 2024 09:53:49 -0400 Subject: [DFDL-WG] Agenda for DFDL WG Call 2024-05-09 Message-ID: https://github.com/OpenGridForum/DFDL/blob/master/calls/2024/2024-05-09_DFDL-WG-Call.md Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1225 bytes Desc: not available URL: From smhdfdl at gmail.com Thu May 9 11:33:07 2024 From: smhdfdl at gmail.com (Steve Hanson) Date: Thu, 9 May 2024 19:33:07 +0100 Subject: [DFDL-WG] DFDL implementation support for element refs In-Reply-To: References: Message-ID: Hi Mike I get the rationale behind having a restricted subset that makes mapping to non-namespaced targets easier. But that isn't a reason to error the use of element references. I think you just need to error the occurrence of any target namespace in the root schema, and any use of xsd:import for a non-DFDL namespace. That also stops the use of namespaces for simple type restrictions and groups, but I think that's fine. If you don't want namespaces involved, outlaw them completely. Regards Steve On Thu, May 9, 2024 at 2:53?PM Mike Beckerle wrote: > I agree element references are useful in the way you state, and we've > built many DFDL schemas in this manner. It's nice to build XML Schemas > this way, it really has little to do with DFDL other than being a pattern > that encourages unit testing of individual record types, which is good > practice. > > The rest of this is probably TL;DR, but rationalizes why we don't need any > change to DFDL, just a flag in Daffodil to escalate a particular warning to > an SDE. > > The challenge is when you need to describe data in DFDL, and then pass it > to something that does not accept XML, or treats XML as a giant string, so > that you really want to instead map the DFDL infoset directly to the native > data structures, not use an XML String. And... that native data structure > has simple names, not namespaced names. > > For example, define and populate Java POJOs from DFDL-described data. > Types aka classes can be in packages with complex names that are like > namespaces. But element names aka members of classes must have simple names > like "A-Za-z09_" as the chars allowed. You can make big long "simple" > names, but that's undesirable. > > If a DFDL (or XSD) Schema has two elements that are peer children within > the same parent, and they differ in QName only by namespace, XSD has no > issue with it. > Daffodil will issue a warning (which can be suppressed) that this will be > incompatible with data that has no namespaces. If you then convert such > data to say, JSON, it will happily populate that same local name with > different things, resulting in data that cannot be unparsed, can't be > reliably queried, can't have a JSON-Schema, etc. > > If the two elements have different namespace prefixes, one could append > the prefix to the element name, and define POJO Java class based on the > global element declaration - using the namespace and prefix to create a > unique class name. > > But it is possible for the two elements to have the same prefix - as it > can be redefined in any enclosing context. In that case one must generate a > unique name given the namespace - probably by just adding a numeric suffix > to the prefixes to make those prefixes unique. > > So it is possible (though a bit complex) to minimize this stuff, and > generate unique names to get out of the way of this problem, however, this > makes the data harder to use. > As an example, Apache Drill is a "use SQL on anything" tool. We've built > an integration (mostly, not quite done) which allows it to query any data > described by a DFDL schema in combination with any of the other databases > and types it can query. > > But its data model does not have element namespaces. For now we just fail > if you have two elements that differ only by namespace. I.e., your DFDL > schema is considered not suitable for Drill querying, and we suggest you > change the schema. > > To avoid this in advance I'm thinking of a Daffodil flag that escalates > this name conflict warning (same name different namespaces) to an SDE, so > that people will proactively get rid of it. > > Unfortunately, we have found this element name problem sneaks into > schemas. It occurs naturally if you are trying to create schemas that > simultaneously handle multiple versions of the same data format. You end up > wanting to have the same element name in one branch of a choice in a > namespace for version 1, and the same element name in another branch of a > choice for version 2, where the choice is discriminated by the version > information. There is no getting around that when querying such data, a > query (such as XPath) can only be polymorphic over versions if you are able > to ignore/bypass the namespace part and use only the local name of the > element in the query language. This can be done in XQuery or XPath using > 'predicates' that match on fn:local-name(). DFDL expressions cannot do this > as we only allow indexing in predicates. > > > > > > > > > On Tue, May 7, 2024 at 3:51?PM Steve Hanson wrote: > >> Mike >> >> IBM DFDL as used by ACE has supported element refs since day one. They >> are really useful, as shown in the DFDL schemas for EDIFACT. Each EDIFACT >> message is a global element, so can be parsed on its own. But there is also >> the EDIFACT interchange global element, which is a collection of EDIFACT >> messages, so the natural approach is to use element refs to pull in the >> EDIFACT messages. >> >> I'll try and join on Thursday but I am away Wed and Thurs, it all depends >> when I get home. >> >> Regards >> Steve >> >> On Mon, May 6, 2024 at 11:20?PM Mike Beckerle >> wrote: >> >>> I'm interested in what DFDL implementations support element references? >>> >>> IBM ACE? >>> IBM zTPF? >>> DFDL4Space? >>> >>> Can you let me know whether these implementations support element refs? >>> >>> The reason I ask is below, which may be of interest or perhaps TL;DR. >>> >>> We support element references in Daffodil, but I'm coming around to the >>> view that element refs are a bad idea in DFDL schemas. >>> >>> They're not needed for any specific data format expressive power. That >>> suggests we should have left them out of DFDL, but for some reason we >>> didn't. >>> >>> The problem is that most data languages have nothing like element >>> references and the associated element namespace management complexity >>> available. >>> >>> So as soon as you want to use a DFDL schema but not use it to >>> interchange data as XML, element refs become a problem. >>> >>> I'm playing around with a best practice/subset/profile suggestion where: >>> >>> * The only global element declarations in the schema are for root >>> elements. >>> * Element references are disallowed >>> * The root elements are declared in a root schema file that contains >>> ONLY the root elements >>> * Root elements should always be declared by one-liners like this: >>> `` >>> * The root elements schema file has no target namespace. >>> * All group, type, and DFDL format/escapeScheme/variable definitions >>> must be declared in different schema files that may (and probably should) >>> have a target namespace. >>> >>> The benefit of these restrictions is that the elements in the nest of a >>> DFDL infoset never have any namespaces. >>> This makes them compatible with non-namespaced data systems like JSON, >>> Apache Drill, Apache NiFi, Generated C code, etc. >>> This makes integration with those things *massively* simpler. >>> >>> Such schemas are still easily reused by reusing the type of the root >>> element, so there is no need to ever use an element reference, and a nice >>> composition property occurs - you don't need element references to assemble >>> schemas from component schemas, and the assembled component has the same >>> characteristic. >>> >>> There are a few other things this discipline also simplifies. Reusing >>> test data becomes simpler if namespace URIs aren't getting embedded in >>> every test infoset XML file, for example. >>> >>> All comments are welcome. >>> >>> Mike Beckerle >>> Apache Daffodil PMC | daffodil.apache.org >>> OGF DFDL Workgroup Co-Chair | >>> www.ogf.org/ogf/doku.php/standards/dfdl/dfdl >>> Owl Cyber Defense | www.owlcyberdefense.com >>> >>> >>> -- >>> dfdl-wg mailing list >>> dfdl-wg at lists.ogf.org >>> https://lists.ogf.org/mailman/listinfo/dfdl-wg >>> >> -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10557 bytes Desc: not available URL: