XPath was designed for exactly this. Java provides support for it in the javax.xml.xpath package.
To do what you want, the code will look something like this:
List<String> findRelations(String word,
Path xmlFile)
throws XPathException {
String xmlLocation = xmlFile.toUri().toASCIIString();
XPath xpath = XPathFactory.newInstance().newXPath();
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("word") ? word : null));
String id = xpath.evaluate(
"//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset",
new InputSource(xmlLocation));
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("id") ? id : null));
NodeList matches = (NodeList) xpath.evaluate(
"//Synset[@id=$id]/SynsetRelations/SynsetRelation",
new InputSource(xmlLocation),
XPathConstants.NODESET);
List<String> relations = new ArrayList<>();
int matchCount = matches.getLength();
for (int i = 0; i < matchCount; i++) {
Element match = (Element) matches.item(i);
String relType = match.getAttribute("relType");
String synset = match.getAttribute("targets");
xpath.setXPathVariableResolver(
name -> (name.getLocalPart().equals("synset") ? synset : null));
NodeList formNodes = (NodeList) xpath.evaluate(
"//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm",
new InputSource(xmlLocation),
XPathConstants.NODESET);
int formCount = formNodes.getLength();
StringJoiner forms = new StringJoiner(",");
for (int j = 0; j < formCount; j++) {
forms.add(
formNodes.item(j).getNodeValue());
}
relations.add(
String.format("%s %s %s", word, relType, forms));
}
return relations;
}
Some basic XPath information:
- XPath uses a single file-path-like string to match parts of an XML document. The parts can be any structural part of the document: text, elements, attributes, even things like comments.
- A Java XPath expression can attempt to match exactly one part, or several parts, or can even concatenate all matched parts as a String.
- In an XPath expression, a name by itself represents an element. For example,
WordForm in XPath means any <WordForm …> element in the XML document.
- A name starting with
@ represents an attribute. For example, @writtenForm refers to any writtenForm=… attribute in the XML document.
- A slash indicates a parent and child in an XML document.
LexicalEntry/Lemma means any <Lemma> element which is a direct child of a <LexicalEntry> element. Synset/@id means the id=… attribute of any <Synset> element.
- Just as a path starting with
/ indicates an absolute (root-relative) path in Unix, an XPath starting with a slash indicates an expression relative to the root of an XML document.
- Two slashes means a descendant which may be a direct child, a grandchild, a great-grandchild, etc. Thus,
//LexicalEntry means any LexicalEntry in the document; /LexicalEntry only matches a LexicalEntry element which is the root element.
- Square brackets indicate match qualifiers.
Synset[@baseConcept='3'] matches any <Synset> element with an baseConcept attribute whose value is the string "3".
- XPath can refer to variables, which are defined externally, using Unix-shell-like
$ substitutions, like $word. How those variables are passed to an XPath expression depends on the engine. Java uses the setXPathVariableResolver method. Variable names are in a completely separate namespace from node names, so it is of no consequence if a variable name is the same as an element name or attribute name in the XML document.
So, the XPath expressions in the code mean:
//LexicalEntry[WordForm/@writtenForm=$word or Lemma/@writtenForm=$word]/Sense/@synset
Match any <LexicalEntry> element anywhere in the XML document which has either
- a WordForm child with a writtenForm attribute whose value is equal to the
word variable
- a Lemma child with a writtenForm attribute whose value is equal to the
word variable
and for every such <LexicalEntry> element, return the value of the synset attribute of any <Sense> element which is a direct child of the <LexicalEntry> element.
The word variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
//Synset[@id=$id]/SynsetRelations/SynsetRelation
Match any <Synset> element anywhere in the XML document whose id attribute is equal to the id variable. For each such <Synset> element, look for any direct SynsetRelations child element, and return each of its direct SynsetRelation children.
The id variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
//LexicalEntry[Sense/@synset=$synset]/WordForm/@writtenForm
Match any <LexicalEntry> element anywhere in the XML document which has a <Sense> child element which has a synset attribute whose value is identical to the synset variable. For each matched element, find any <WordForm> child element and return that element’s writtenForm attribute.
The synset variable is defined externally, by an xpath.setXPathVariableResolver, right before the XPath expression is evaluated.
Logically, what the above should amount to is:
- Locate the synset value for the requested word.
- Use the synset value to locate SynsetRelation elements.
- Locate writtenForm values corresponding to the targets value of each matched SynsetRelation.