Monday, September 1, 2008

Hack attempts by country

I use denyhosts to block addresses that runs dictionary attacks on my SSH server.

GeoIP and python can be used to lookup country of origin of these addresses, and simple shell commands to generate a list of most common countries.

$ cat geoip.py
import GeoIP, sys
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
for addr in sys.stdin.readlines(): print gi.country_name_by_addr(addr.strip())


$ grep ssh hosts.deny |cut -d " " -f 2 |python geoip.py |sort |uniq -c |sort -nr |head
34 China
17 None
17 Korea, Republic of
11 United States
5 United Kingdom
5 Italy
5 Brazil
4 Thailand
4 Japan
4 Germany


China wins. But please note that there's 17 addresses that couldn't be resolved so the margin of error is pretty large.

Friday, August 29, 2008

Parse HTML using CSS selectors

lxml is a nice library for parsing XML and HTML with python. It can use CSS selectors to find nodes.

Here's an example that shows how smooth it is to use.
>>> from lxml.html import parse
>>> google = file("google_se.html") # saved google result page for "example"
>>> root = parse(google).getroot()

This one fetches all the anchor texts (truncated to not break the page.)
>>> [link.text_content()[:20] for link in root.cssselect(".g h3.r a")]
['Image results for ex', 'Example (rapper) - W', 'Example - Wikipedia,', 'MySpace.com - EXAMPL', 'Dynamic Programming ', 'example - definition', 'example - Definition', "Example - I don't wa", 'XML by Example - Goo', 'Example', 'example']

This one fetches all the link destinations (also truncated.)
>>> [link.get("href")[:20] for link in root.cssselect(".g h3.r a")]
['http://images.google', 'http://en.wikipedia.', 'http://en.wikipedia.', 'http://www.myspace.c', 'http://www.avatar.se', 'http://www.thefreedi', 'http://www.merriam-w', 'http://www.youtube.c', 'http://books.google.', 'http://www.example.o', 'http://www.docbook.o']