I use denyhosts to block addresses that runs dictionary attacks on my SSH server.
GeoIP and python can be used to lookup country of origin of these addresses, and simple shell commands to generate a list of most common countries.
$ cat geoip.py
import GeoIP, sys
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
for addr in sys.stdin.readlines(): print gi.country_name_by_addr(addr.strip())
$ grep ssh hosts.deny |cut -d " " -f 2 |python geoip.py |sort |uniq -c |sort -nr |head
34 China
17 None
17 Korea, Republic of
11 United States
5 United Kingdom
5 Italy
5 Brazil
4 Thailand
4 Japan
4 Germany
China wins. But please note that there's 17 addresses that couldn't be resolved so the margin of error is pretty large.
Monday, September 1, 2008
Friday, August 29, 2008
Parse HTML using CSS selectors
lxml is a nice library for parsing XML and HTML with python. It can use CSS selectors to find nodes.
Here's an example that shows how smooth it is to use.
This one fetches all the anchor texts (truncated to not break the page.)
This one fetches all the link destinations (also truncated.)
Here's an example that shows how smooth it is to use.
>>> from lxml.html import parse
>>> google = file("google_se.html") # saved google result page for "example"
>>> root = parse(google).getroot()
This one fetches all the anchor texts (truncated to not break the page.)
>>> [link.text_content()[:20] for link in root.cssselect(".g h3.r a")]
['Image results for ex', 'Example (rapper) - W', 'Example - Wikipedia,', 'MySpace.com - EXAMPL', 'Dynamic Programming ', 'example - definition', 'example - Definition', "Example - I don't wa", 'XML by Example - Goo', 'Example', 'example']
This one fetches all the link destinations (also truncated.)
>>> [link.get("href")[:20] for link in root.cssselect(".g h3.r a")]
['http://images.google', 'http://en.wikipedia.', 'http://en.wikipedia.', 'http://www.myspace.c', 'http://www.avatar.se', 'http://www.thefreedi', 'http://www.merriam-w', 'http://www.youtube.c', 'http://books.google.', 'http://www.example.o', 'http://www.docbook.o']
Subscribe to:
Posts (Atom)