Learn
Turtle
XML Parser Using Python
XML Parser Using Python

XML Parser Using Python

Hello!! Learners,

We are going to parse a given XML file and extract some useful data out of it in structured way but before some terms like XML are explained RSS . After that we will apply parsing by inbuilt xml module in python and in the article main focus will be on the Element Tree XML API of this module and we will do the same using DOM (Document Object Model) APIs in Python which is pretty simple to use than Element Tree XML API So go at last of this article for quick implementation by DOM API. Lets start by Element Tree XML API XML: XML stands for eXtensible Markup Language. It was designed to store and transport data as well as human- and machine-readable. and hence the design goals of XML emphasize generality, usability and simplicity across the Internet. Here we are going to work with RSS feed. RSS: RSS(Rich Site Summary, often called Really Simple Syndication) uses a family of standard web feed formats to publish frequently updated information like blog entries, news headlines, audio, video. RSS is XML formatted plain text.

Steps we are going to follow –

a) Loading and saving RSS feed

def loadRSS():
# url of rss feed
url = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'
# creating HTTP response object from given url
resp = requests.get(url)
# saving the xml file
with open('topnewsfeed.xml', 'wb') as f:
f.write(resp.content)

We get a http response which contain the XML data which we will save as topnewsfeed.xml in our local directory.

b) Parsing XML

parseXML() function is created to parse XML file. XML is an inherently hierarchical format , and thus the foremost natural because of represent it's with a tree Image will give you explanation – image As discussed earlier we are becoming to use xml.etree.ElementTree (called as ET) module. Element Tree has two classes for parsing – Element Tree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the entire document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with one XML element and its sub-elements are done on the Element level. Lets visit parseXML() function :

def parseXML(xmlfile):
# create element tree object
tree = ET.parse(xmlfile)
# get root element
root = tree.getroot()
# create empty list for news items
newsitems = []
# iterate news items
for item in root.findall('./channel/item'):
# empty news dictionary
news = {}
# iterate child elements of item
for child in item:
# special checking for namespace object content:media
if child.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = child.attrib['url']
else:
if child.text :
news[child.tag] = child.text.encode('utf8')
# append news dictionary to news items list
newsitems.append(news)
# return news items list
return newsitems

We will go step by step -

tree = ET.parse(xmlfile)

Here, creation of ElementTree object by parsing the passed xmlfile is completed.

root = tree.getroot()

getroot() function will return Element tree as a component object.

for item in root.findall('./channel/item'):

Now, once you've taken a look at the structure of your XML file, you'll notice that we've an interest only in item element. ./channel/item is actually XPath syntax (XPath could even be a language for addressing parts of an XML document). Here, we'd wish to hunt out all item grand-children of channel children of the root(denoted by ‘.’) element. Now, we all know that we are iterating through item elements where each item element contains one news. So, we create an empty news dictionary during which we'll store all data available about item . To iterate though each child element of a component , we simply iterate through it, like this:

for child in item:

Now, notice a sample item element here: image We will got to handle namespace tags separately as they get expanded to their original value, when parsed. So, we do something like this:

if child.tag == '{http://search.yahoo.com/mrss/}content':news['media'] = child.attrib['url']

child.attrib may be a dictionary of all the attributes associated with a component . Here, we have an interest in url attribute of media:content namespace tag. Now, for all other children, we simply do:

news[child.tag] = child.text.encode('utf8')

child.tag have the name of child element. child.text contains all the text inside that child element. So a sample item element is converted to a dictionary and will looks like this:

{
'description': 'Ignis features a hard competition already, from Hyun.... ,
'guid': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... ,
'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... ,
'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/... ,
'pubDate': 'Thu, 12 Jan 2017 12:33:04 GMT ',
'title': 'Maruti Ignis launches on Jan 13: Five cars that threa.....
}

Then, we will append this dict element to the list newsitems. Finally, this list is returned. Part of data in newsitems is shown within the image image

C) Saving data to a CSV file

def savetoCSV(newsitems, filename):
# specifying the fields for csv file
fields = ['guid', 'title', 'pubDate', 'description','link','media']
# writing to csv file
with open(filename, 'w') as csvfile:
# creating a csv dict writer object
writer = csv.DictWriter(csvfile, fieldnames = fields)
# writing headers (field names)
writer.writeheader()
# writing data rows
writer.writerows(newsitems)

This function will save data to CSV file which can appear as if – image Now Its turn for DOM API – The Document Object Model (DOM) could also be a programming interface for HTML and XML(Extensible markup language) documents. It defines the logical structure of documents and therefore the way a document is accessed and manipulated. Parsing XML with DOM APIs in python is pretty simple. Example XML file (sample.xml) Sample.xml -

LearnTurtle
Rishab Gupta
18
Pravin mahajan
28
Prajikta koli
26

Implementation –

from xml.dom import minidom
doc = minidom.parse("sample.xml")
# doc.getElementsByTagName returns the NodeList
name = doc.getElementsByTagName("name")[0]
print(name.firstChild.data)
staffs = doc.getElementsByTagName("staff")
for staff in staffs:
staff_id = staff.getAttribute("id")
name = staff.getElementsByTagName("name")[0]
age = staff.getElementsByTagName("age")[0]
print("id:% s, name:% s, age:% s" %
(staff_id, name.firstChild.data, age.firstChild.data))

Output:

LearnTurtle
id:1, name:Rishab Gupta, age:18
id:1, name:Pravin mahajan, age:28
id:1, name:Prajikta koli, age:26

Raj Kothari
Jul 5, 2020
ME(R/A)N | Machine Learning | Student Mentor | Mobile | Tech Writer | Learner
Read More

Read 0 times

Comment!