* INCOMPLETE
* XML Schema Inference Rules
** Requirements
XmlReader:
- that does not expose EntityReference.
- that does not contain xsd:* elements.
XmlSchemaSet: only that was generated by this utility class. See
particle inference section described later.
Actually MS implementation has insufficient check for this input,
so it accepts more than it expects.
*** Allowed schema components
Before infering merged particles with premised particles in
XmlSchemaSet, we have to know what is expected and what is not:
- facets are not supported. [a014.xsd]
- xs:all is not supported. [a003.xsd]
- xs:group (ref) is not supported. [a004.xsd]
- xs:choice that does not contain xs:sequence is not
supported [a005.xsd].
- xs:any is not supported. Only xs:element are expected
to be contained in xs:sequence. [a011.xsd]
- same name particles that are still not ambiguous
are computed into invalid particles. It looks
like MS's unexpected bug. [a010.xsd]
- attributeGroup looks not supposed to be there (MS has a
bug around here). [a006.xsd]
- anyAttribute is not regarded as a valid particle, and
the output complexType definition just rips them out.
[a013.xsd]
- but substitutionGroup is not rejected and it will remain
in the output. [a001.xsd]
-> It must be rejected. It breaks choice compatibility.
** Processing model
First, parameter XmlSchemaSet is compiled[*1] and interpreted into
its internal schema representation that is going to be used for
XmlReader input examination. The resulting XmlSchemaSet is the same
as the input XmlSchemaSet.
[*1] FIXME: this design might change.
The XmlSchemaSet is compiled and , because 1) it might contain
XmlSchemaInclude items. So it won't be possible to process inference
inside the input schema set. However, reusing the input reduces
some annoyance; to preserve elementFormDefault etc.
Second, XmlReader is moved to content (document element) and
"element inference" starts from here (described later).
Resulting XmlSchemaSet keeps original XmlSchemas into itslef.
For example, it keeps elementFormDefault and attributeFormDefault.
Basically it will process the XmlReader with existing XmlSchemaSet
and won't "merge" two XmlSchemaSets one of which is newly infered
from this XmlReader. Because anyways the XmlReader will have to
infer sequential nodes (siblings).
Once the element definition is determined (or created), any other
branches in the schema are ignored.
** Attributes
*** attribute component definitions and references.
**** ignored attributes
xsi:type, xsi:schemaLocation and xsi:noNamespaceSchemaLocation
attributes are ignored.
**** special attributes
If xsi:nil does exist, then its content are not handled, while its
attributes are handled.
xml:* schema are predetermined; it has a fixed schema for that ns.
**** namespaced attributes
miscellaneous attributes that resides in a certain namespace is
referenced as
**** local attributes
miscellaneous attributes are represented as
*** attribute occurence
when defining a complexType for a newly-created element, the attribute
can be set as "required". Otherwise, it must be set as "optional".
For every element instance occurence, all attributes are tested
existence, and if it does not, then it must be set as "optional".
*** attribute value types
FIXME: need to describe the relaxation of attribute value types.
** Content model inference
*** inference processing model
Content model consists of two parts;
- content type : empty | elementOnly | textOnly | mixed
- particle : sequence | choice | all | groupRef
On processing reader.Read(), the node is first "tested" against
current schema content model. If the current node on the XmlReader
is not acceptable, then "content model expansion" happens.
- If the current node is text content, then process the
text node according to "evaluating text content".
- If the current node is an element, then process it
in accordance with "evaluating particle".
*** evaluating element
When an element occured, then it must be accepted as a particle.
First, content type must be examined:
- If the content type was simpleType, then it is changed
into complexType with complexContent and mixed='true'.
The infered content particle must be optional.
- If the content type was empty, then it is changed into
complexType with complexContent (it is not mixed unlike
above). The infered content particle must be optional.
- If the content type was elementOnly or mixed, no need
to change.
Next, the content particle must be evaluated.
According to the input XmlSchemaSet limitations, there will be
only these patterns listed here:
- empty content
- simple content
- sequence (of element particles)
- choice of sequences
**** Reader progress
Every element is tested against current element candidates.
- When the target element is a document element, then all
the global elements in XmlSchemaSet are the candidates.
- If there is a maching name, then that element
definition is used as the context element for
the node's content, and current particle is
in front of the first particle.
- If there isn't, then the inference engine creates
a new element definition, and content is none
(none != empty).
- When the target element is infered in a new element
definition, then
**** Particle inference
IMPORTANT: Here I tried to formalize the inference, but it is
incomplete notes.
Target {particle} to add:
isNew -> ...
!isNew -> ...
no definition
// define complexType and add {particle} to .Particle
toComplexType()
processcontent(ct.Particle, isNew)
simpleType
makeComplexContent()
complexType
empty definition (no content model, no particle)
// -> add xs:element name={name} minOccurs="0" to .Particle
-> processcontent(ct.Particle, isNew)
simple content
-> makeComplexContent()
complex content / extension
-> processContent(cce.Particle, isNew)
complex content / restriction
-> processContent(ccr.Particle, isNew)
.Particle
-> processContent(ct.Particle, isNew)
makeComplexContent()
change to complexType which has complex content mixed="true" and
extension. Discard simple type information. Add {particle} to
extension's .Particle.
processContent(Particle particle, isNew)
if particle is either empty or sequence
processSequential(particle, 0, false, isNew)
else if particle is sequence of choices
processLax(particle, 0)
else
error.
processSequential(Sequence particle, int index, bool consumed, bool isNew)
particle.Count <= index
-> appendSequential(particle, isNew)
sequence
if (particle[index] has the same name)
-> if (consumed) then sequence[index].maxOccurs = inf.
InferElement (sequence[index])
processParticles(particle, index, true)
else
-> if (!consumed)
sequence[index].minOccurs = 0.
processParticle(particle, index+1, false)
else
particle = toSequenceOfChoice(particle)
processLax(particle, index)
processLax(choice, index)
foreach (element el in choice.Items)
if (el has the same name)
InferElement (el)
processLax(choice, index + 1)
return;
appendLax(particle)
appendSequential(particle)
if (particle is empty)
make particle as sequence
sequence.Items.Add(InferElement(null))
appendLax(choice)
choice.Items.Add(InferElement(null))
*** evaluating text content
When text content occured, it must be accepted as simple content.
- If the content type was textOnly, then "type relaxation"
happens (described later).
- If the content type was already mixed, then it is skipped.
- If the content type was elementOnly, then the content type
becomes mixed and then skipped.
- If the content type was empty, then its content type
becomes text and then skipped. The type is xs:string (no
type promotion will happen since empty value cannot be
accepted as any other types handles in this design).
(Actually inference is done from non post compilation information.)
Note that type relaxation happens only when it is infered as textOnly
and it always occurs.
** Type inference
All data types are infered from string value; either element content
or attribute value.
*** primitive type inference
When a string is being evaluated as xs:blahblah typed value, it is
tried against several types.
- First, it is evaluated as xs:boolean; true, false, 1 or 0.
- Next, its integer value is computed. 1) If it is
successful, then its value range is examined if it
matches with unsignedByte, byte, unsignedShort, short,
unsignedInt, int, unsignedLong, long, and integer.
- If it was not an integer, then it is evaluated as a float
number, as a double number, and then as a decimal number
as well.
- Next, it is examined as xs:dateTime, xs:duration and
related schema types.
- If if did not match any kind of predefined types, then
xs:string is infered. No other string-based types (such
as xs:token) are infered.
*** type relaxation
When a string value is being accepted with existing type, the type
might have to change to accept it.
For example:
- xs:int cannot accept "abc"
- string with maxLength="3" cannot accept "abcd"
facets are not created anyways and thus not supported
by this inference engine.
- 12345 is not acceptable for xs:unsignedByte, but acceptable
for unsignedShort
Here, the new string value is infered into a simpleType, and then
the processor will compute the most specific common type between
the existing type and the newly infered type.