Friday, August 29, 2008

Parse HTML using CSS selectors

lxml is a nice library for parsing XML and HTML with python. It can use CSS selectors to find nodes.

Here's an example that shows how smooth it is to use.
>>> from lxml.html import parse
>>> google = file("google_se.html") # saved google result page for "example"
>>> root = parse(google).getroot()

This one fetches all the anchor texts (truncated to not break the page.)
>>> [link.text_content()[:20] for link in root.cssselect(".g h3.r a")]
['Image results for ex', 'Example (rapper) - W', 'Example - Wikipedia,', 'MySpace.com - EXAMPL', 'Dynamic Programming ', 'example - definition', 'example - Definition', "Example - I don't wa", 'XML by Example - Goo', 'Example', 'example']

This one fetches all the link destinations (also truncated.)
>>> [link.get("href")[:20] for link in root.cssselect(".g h3.r a")]
['http://images.google', 'http://en.wikipedia.', 'http://en.wikipedia.', 'http://www.myspace.c', 'http://www.avatar.se', 'http://www.thefreedi', 'http://www.merriam-w', 'http://www.youtube.c', 'http://books.google.', 'http://www.example.o', 'http://www.docbook.o']