Android,iOS,Gadgets,Reviews Everything About Technology

LXML: Fast and flexible XML and HTML processing in Python


You can also use BeautifulSoup to process XML and HTML , but this library uses too much memory and does not meet the speed requirements for large files. The LXML library very quickly opens and processes large XML or HTML files, so we’ll look at it in this article.

You can install it from here. Let’s look at the library for examples.

Let’s start with XML. First, let’s import the lxml library itself and define the text with xml markup.

- Advertisement -

~~~ {.python} from lxml import html, etree

example_xml = “” ”

TimmyRichReminderRemeber the concert tickets. EricJoshRide PlansMeet at the gas station on the corner of Diffley and 13 at 6:00 pm.

“” ”

Notes – notes


to – whom

heading – heading

body – note text

The XML structure defines two notes: with a message and additional information. Let's get some data from this structure.
Using the `fromstring` method, we convert xml data to a new object.
notes = etree.fromstring(example_xml)

And this is how you can walk through all the descendants of the element notes:

~~~ {.python} for note in notes.getchildren (): print note.tag



The selection of the `note` object for further viewing:
note = notes.getchildren()[0]

Output data for each child of the element notes:

~~~ {.python} for field in note.iterchildren (): print ‘% s:% s’% (field.tag, field.text)

to: Timmy

from: Rich

heading: Reminder

body: Remeber the concert tickets.

Find children with the tag `to`:
for field in notes.findall('.//to'):
print 'Note to: %s' % field.text

# Note to: Timmy
# Note to: Eric

Now that we’ve got the data from our XML, we need to try to process the HTML. Let’s define an example of a string for working with HTML.

~~~ {.python} example_html = “” ”

“” ”

Now we will create our object with `document_fromstring`.
doc = html.document_fromstring(example_html)

And we can already do a sample of data. We can use CSS selectors to find elements; This is my favorite way to find items within the html document.

Find a title using CSS selectors:

~~~ {.python} title = doc.cssselect (‘head title’) [0] print title.text

Example HTML Title

Get all the elements with the class `.item` in the div tag with the class` .all_items`:
items = doc.cssselect('div.all_items .item')
for item in items:
print item.text.strip()

# This is the first paragrah.
# This is the second paragraph.

This is just a basic demonstration of LXML, but even for these simple examples we see how easy it is to use this library. There are many additional features that we have not covered in this article! If you want to learn more, see them in the LXML documentation.