Harvesting data from websites using WebKit and PyQt4 - part 2

Check out the new site at https://rkblog.dev.

10 November 2009 Comments

In this tutorial we will create parsers that will extract ad urls from the parsed page source (DOM tree). For flash ads it's common to use a "clickTag" variable passed to the flash ad, which holds the URL on which the flash add will redirect. In case of advertising companies that use text or images as ads in their JS widgets we have to analyze the parsed HTML code to find a pattern allowing regex extraction of URLs. For example AdTaily.com widget puts HTML code like this:

<a style="position: relative; font-weight: normal; text-align: left; background-image: none; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: initial; padding-left: 0px; padding-right: 0px; padding-top: 0px; padding-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; margin-bottom: 10px; display: block; width: 125px; height: 125px; background-position: initial initial; " href="http://www.megiteam.pl/" title="Hosting nowych technologii" rel="nofollow" target="_blank">

Very unique set of styles means we can easily extract the url (urls only from adtaily widget)

AdTaily Parser

In the folder where my app is located I've created parsers.py for the parsers.

Below is the AdTaily parser - it get the full HTML of a webpage, and regex data out of it:

# -*- coding: utf-8 -*-
from re import findall

def get_adtaily(html):
	"""
	Extract data from adtaily widgets
	"""
	links = findall(r'background-position: initial initial; " href="(.*?)"', unicode(html))
	ret = []
	for i in links:
		if not i.startswith('http://www.adtaily'):
			ret.append(i)
	return ret

Import parsers in run.py and modify loadFinished method:

def loadFinished(self):
		"""
		A page was loaded - get the data and load next page
		"""
		page = self.ui.webView.page()
		frame = page.currentFrame()
		content = frame.toHtml()
		print u'Page content, got %s bytes' % len(content)
		
		links = get_adtaily(content)
		
		print links
		if self.currentRefresh < self.refreshSite:
			print 'Refresh +1'
			self.currentRefresh += 1
		else:
			print 'Index +1'
			self.currentRefresh = 0
			self.currentIndex += 1
		
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)

We pass the rendered HTML code of a loaded page to get_adtaily and print the list of URLs it found. If you run the app on a webpage that cointains AdTaily widget (like www.python.rk.edu.pl) then you should see urls of ads from this widget.

Flash ads - clickTag

For flash ads "clickTag" (and "click", "gaadlink" etc.) is used to pass the redirect URL (through click counting script that redirects to the final site). The embed tag may look like this:

<embed height="200" width="750" name="td_flash" id="td_flash" wmode="opaque" swliveconnect="true" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash" type="application/x-shockwave-flash" quality="best" menu="false" flashvars="clickTAG=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&amp;CLICKTAG=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&amp;clicktag=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&amp;clickTag=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&amp;ClickTag=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487" src="http://ads.open.pl/kreacje/2009/05_09/doplata/google_750x200_z_doplata.swf"/>

Parser for "clickTag" will look like this:

def get_clickTag(html):
	"""
	Extract data from flash ads
	"""
	links = findall(r'clicktag=(.*?)"', unicode(html).lower())
	ret = []
	for i in links:
		ret.append(unquote(i))
	return ret

You have to check how URLs are passed to flash ads on sites which you want to harvest. (unquote is from urllib)

Saving data to a database

In Qt databases are handled with QtSql component. We can use it in PyQt4, but the API won't be compatible with the standard DB API for Python modules.

Import the module:
from PyQt4.QtSql import *

Connect to the database in __init__:

self.db = QSqlDatabase.addDatabase("QSQLITE")
		self.db.setDatabaseName("ads")
		self.dbstatus = self.db.open()
		if self.dbstatus:
			print 'DB ok'
		else:
			print 'DB error'

We select the driver (SQLite), and next we specify database name and connect.

We also need a table for the data. This one will be ok:

CREATE TABLE "ads_data" (
    "id" integer NOT NULL PRIMARY KEY,
    "link" varchar(255) NOT NULL,
    "dest_title" varchar(255) NULL,
    "dest_url" varchar(255) NULL,
    "date" datetime NOT NULL,
    "source" varchar(255) NOT NULL,
    "is_parsed" bool NULL
);

You can create this table using command line tools for SQLite:

sqlite3 ./ads

We have the database, so start inserting data:

def loadFinished(self):
		"""
		A page was loaded - get the data and load next page
		"""
		page = self.ui.webView.page()
		frame = page.currentFrame()
		content = frame.toHtml()
		print u'Page content, got %s bytes' % len(content)
		
		links = get_adtaily(content)
		links2 = get_clickTag(content)
		
		for i in links2:
			links.append(i)
		
		query = QSqlQuery(self.db)
		try:
			source = self.sites[self.currentIndex]['site']
		except:
			return
		
		# insert every link
		for link in links:
			qry = "INSERT INTO ads_data ('link', 'date', 'source', 'is_parsed') VALUES ('%s', '%s', '%s', 0);" % (link, date.today(), source)
			if query.exec_(qry):
				print u'INSERT'
			else:
				print 'Insert Error'
				print qry
				print query.lastError().text()
				print
		
		if self.currentRefresh < self.refreshSite:
			print 'Refresh +1'
			self.currentRefresh += 1
		else:
			print 'Index +1'
			self.currentRefresh = 0
			self.currentIndex += 1
		
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)

We create object query = QSqlQuery(self.db) and execute a query using exec_ method. The URL will be saved in the DB, but if an error occur - we will see the error message.

Our example can be improved - by blocking duplicate links for current day:

def loadFinished(self):
		"""
		A page was loaded - get the data and load next page
		"""
		page = self.ui.webView.page()
		frame = page.currentFrame()
		content = frame.toHtml()
		print u'Page content, got %s bytes' % len(content)
		
		links = get_adtaily(content)
		links2 = get_clickTag(content)
		
		for i in links2:
			links.append(i)
		
		query = QSqlQuery(self.db)
		try:
			source = self.sites[self.currentIndex]['site']
		except:
			# no more sites
			return
		
		for link in links:
			cnt = 0
			if query.exec_("SELECT COUNT(*) FROM ads_data WHERE date = '%s' AND source = '%s' AND link = '%s'" % (date.today(), source, link)):
				query.next()
				cnt = query.value(0).toInt()[0]
				print cnt
			
			if cnt < 1:
				qry = "INSERT INTO ads_data ('link', 'date', 'source', 'is_parsed') VALUES ('%s', '%s', '%s', 0);" % (link, date.today(), source)
				if query.exec_(qry):
					print u'INSERT'
				else:
					print 'Insert Error'
					print qry
					print query.lastError().text()
					print
			else:
				print 'pass'
		
		if self.currentRefresh < self.refreshSite:
			print 'Refresh +1'
			self.currentRefresh += 1
		else:
			print 'Index +1'
			self.currentRefresh = 0
			self.currentIndex += 1
		
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)

In this snippet we selected some data. To iterate over data use query.next(), which will return True if it got another row (while query.next() for many rows). In our case we select one row. Method query.value(index) will return result data as QVariant object, so we have to make it a string or int.

In third part we will create second app, that will load saved urls and update their entries with final site title and url.

Source code

Download ZIP

RkBlog

PyQt and GUI, 10 November 2009

Check out the new site at https://rkblog.dev.