Aggregating and searching for developers projects in Python

Check out the new site at https://rkblog.dev.

9 October 2009 Comments

bitbucket.org, code.google.com, or github.com host a lot of open source projects made by developers around the world. If you make a website or an article about given technology, language it would be nice to gather projects matching your topic from those code hosting providers. In this article I'll show Python scripts that can do that easily.

Github

You can use Github API, that returns data in YAML format, but in case of search - you can't get all the search results, so we have to parse the HTML search results like this:

import urllib2
from re import findall

def github_getall(term, page=1):
	opener = urllib2.build_opener()
	opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
	o = opener.open('http://github.com/search?type=Repositories&language=&q=%s&repo=&langOverride=&x=5&y=22&start_value=%s' % (term, str(page)))
	data = o.read()
	repos = findall( r'(?xs)<h2\s*class="title">(.*?)<a\s*href="(.*?)">(.*?)</a>(.*?)<div\s*class="description">(.*?)</div>''', data)
	if len(repos) > 0:
		for i in repos:
			name = i[2].split(' / ')
			author = name[0].strip()
			title = name[1].strip()
			
			url = 'http://github.com/%s/%s/' % (author, title)
			desc = i[4].strip()
			print title
			print url
			print desc
			print


print github_getall('django', 1)

This function will get the search results and parse it returning a clean list of repositories. If you want to get all results use the paging

for i in range(1,10):
	try:
		github_getall('google+wave', i)
	except:
		# no more pages in pagination, etc.
		print 'EXCEPTION'

Google Code

For code.google.com we can use similar code:

import urllib2
from re import findall

def googlec(term, page=1):
	openurl = 'http://code.google.com/hosting/search?q=%s&btn=Search+projects' % term
	if page > 1:
		openurl = '%s&start=%s' % (openurl, str(page*10))
	
	opener = urllib2.build_opener()
	opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
	o = opener.open(openurl)
	data = o.read()
	repos = findall( r'(?xs)clk\(this,\s*([0-9]*)\)"(.*?)href="(.*?)">(.*?)
\s*-\s*(.*?)</a>''', data)
	if len(repos) > 0:
		for i in repos:
			url = 'http://code.google.com%s' % i[2]
			title = i[3].strip()
			desc = i[4].strip()
			
			print title
			print desc
			print url
			print
			

print googlec('django', 1)

Bitbucked

For Bitbucked this will work nicely:

import urllib2
from re import findall

def bitbucket(term, page=1):
	opener = urllib2.build_opener()
	opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
	o = opener.open('http://bitbucket.org/repo/all/popular/%s/?name=%s' % (str(page), term))
	data = o.read()
	repos = findall( r'(?xs)<span><a\s*href="(.*?)">(.*?)</a>\s*/\s*<a\s*href="(.*?)">(.*?)</a></span><br\s*/>(.*?)	(.*?).<br\s*/>''', data)
	if len(repos) > 0:
		for i in repos:
			author = i[1]
			url = 'http://bitbucket.org%s' % i[2]
			title = i[3]
			desc = i[5].strip()
			
			print title
			print author
			print url
			print desc
			print
			

print bitbucket('django', 1)

I've used presented in this article code to make a small Django app - Projects - that aggregates projects for given tags.

RkBlog

Python programming, 9 October 2009

Check out the new site at https://rkblog.dev.