Manipulate XMLs using XPath

From Media Design: Networked & Lens-Based wiki
Jump to navigation Jump to search

Manipulate xml

xml is a quite recurrent format to organize data, be it vector graphics (.svg), .scribus files (.sla), Open Stree Map maps (.osm), Ardour multitrack audio project (.ardour), or Calibre books' metadata, the data organized under the xml format.

We can use XPath syntax to query, and change xml files.

http://www.w3schools.com/xpath/xpath_syntax.asp provides a reasonable sum-up of the XPath syntax.


I will be giving two examples:


1) manipulates the name of the streets found of o Open Stree Map (.osm) The name information is found within the tag element, under the attribute v

<tag k="name" v="Französische Straße"/>

An I will use regex to remove the "Straße" from all the tag elements within my .osm file

 
#! /usr/bin/python
# encoding: utf-8

import lxml.etree, re
f = ("data-xberg.osm") #my file
doc = lxml.etree.parse(f)# parse the data
ways = doc.xpath("//tag[@k='name']") #from all the tag elements query its k attribute
strasse_l = []

for t in ways:
	strasse_v = t.get('v')  
	strasse_re = re.sub("(strasse)|(straße)", "", strasse_v, flags=re.I) #sustitute for ""
 	print strasse_re	

	
text = lxml.etree.tostring(doc, encoding="utf-8", xml_declaration=True) # unpack the lxml etree
print text

n =open("/home/andre/osmarender/data-manip.osm", "w")
n.write(text)


2) removes all the area and circle elements present in the osm-map-features.xml file.

 
# parse the osm-map-features.xml and removes tag
# save to file osm-map-features_a.xml
 
import lxml.etree, urllib2, codecs
f = ("/home/andre/osmarender/osm-map-features.xml")
doc = lxml.etree.parse(f)
 
# elements to delete: all area and all circle
# syntax: ('//area|//circle')
 
for i in doc.xpath( '//area|//circle' ) :
    i.getparent().remove(i) #we need to remove the parent from the element 
 
text = lxml.etree.tostring(doc, encoding="utf-8", xml_declaration=True) 
 
g = open("/home/andre/osmarender/osm-map-features_chg1.xml", "w")
g.write(text)