Aggregating and searching for developers projects in Python
Check out the new site at https://rkblog.dev.
9 October 2009
Comments
bitbucket.org, code.google.com, or github.com host a lot of open source projects made by developers around the world. If you make a website or an article about given technology, language it would be nice to gather projects matching your topic from those code hosting providers. In this article I'll show Python scripts that can do that easily.
Github
You can use Github API, that returns data in YAML format, but in case of search - you can't get all the search results, so we have to parse the HTML search results like this:import urllib2
from re import findall
def github_getall(term, page=1):
opener = urllib2.build_opener()
opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
o = opener.open('http://github.com/search?type=Repositories&language=&q=%s&repo=&langOverride=&x=5&y=22&start_value=%s' % (term, str(page)))
data = o.read()
repos = findall( r'(?xs)<h2\s*class="title">(.*?)<a\s*href="(.*?)">(.*?)</a>(.*?)<div\s*class="description">(.*?)</div>''', data)
if len(repos) > 0:
for i in repos:
name = i[2].split(' / ')
author = name[0].strip()
title = name[1].strip()
url = 'http://github.com/%s/%s/' % (author, title)
desc = i[4].strip()
print title
print url
print desc
print
print github_getall('django', 1)
for i in range(1,10):
try:
github_getall('google+wave', i)
except:
# no more pages in pagination, etc.
print 'EXCEPTION'
Google Code
For code.google.com we can use similar code:import urllib2
from re import findall
def googlec(term, page=1):
openurl = 'http://code.google.com/hosting/search?q=%s&btn=Search+projects' % term
if page > 1:
openurl = '%s&start=%s' % (openurl, str(page*10))
opener = urllib2.build_opener()
opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
o = opener.open(openurl)
data = o.read()
repos = findall( r'(?xs)clk\(this,\s*([0-9]*)\)"(.*?)href="(.*?)">(.*?)
\s*-\s*(.*?)</a>''', data)
if len(repos) > 0:
for i in repos:
url = 'http://code.google.com%s' % i[2]
title = i[3].strip()
desc = i[4].strip()
print title
print desc
print url
print
print googlec('django', 1)
Bitbucked
For Bitbucked this will work nicely:import urllib2
from re import findall
def bitbucket(term, page=1):
opener = urllib2.build_opener()
opener.addheaders = [('user-agent', 'Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1')]
o = opener.open('http://bitbucket.org/repo/all/popular/%s/?name=%s' % (str(page), term))
data = o.read()
repos = findall( r'(?xs)<span><a\s*href="(.*?)">(.*?)</a>\s*/\s*<a\s*href="(.*?)">(.*?)</a></span><br\s*/>(.*?) (.*?).<br\s*/>''', data)
if len(repos) > 0:
for i in repos:
author = i[1]
url = 'http://bitbucket.org%s' % i[2]
title = i[3]
desc = i[5].strip()
print title
print author
print url
print desc
print
print bitbucket('django', 1)
I've used presented in this article code to make a small Django app - Projects - that aggregates projects for given tags.
RkBlog
Check out the new site at https://rkblog.dev.
Comment article