Harvesting data from websites using WebKit and PyQt4 - part 2
Check out the new site at https://rkblog.dev.
10 November 2009
Comments
In this tutorial we will create parsers that will extract ad urls from the parsed page source (DOM tree). For flash ads it's common to use a "clickTag" variable passed to the flash ad, which holds the URL on which the flash add will redirect. In case of advertising companies that use text or images as ads in their JS widgets we have to analyze the parsed HTML code to find a pattern allowing regex extraction of URLs. For example AdTaily.com widget puts HTML code like this:
<a style="position: relative; font-weight: normal; text-align: left; background-image: none; background-repeat: initial; background-attachment: initial; -webkit-background-clip: initial; -webkit-background-origin: initial; background-color: initial; padding-left: 0px; padding-right: 0px; padding-top: 0px; padding-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px; margin-bottom: 10px; display: block; width: 125px; height: 125px; background-position: initial initial; " href="http://www.megiteam.pl/" title="Hosting nowych technologii" rel="nofollow" target="_blank">
AdTaily Parser
- In the folder where my app is located I've created parsers.py for the parsers.
- Below is the AdTaily parser - it get the full HTML of a webpage, and regex data out of it:
# -*- coding: utf-8 -*- from re import findall def get_adtaily(html): """ Extract data from adtaily widgets """ links = findall(r'background-position: initial initial; " href="(.*?)"', unicode(html)) ret = [] for i in links: if not i.startswith('http://www.adtaily'): ret.append(i) return ret
- Import parsers in run.py and modify loadFinished method:
def loadFinished(self): """ A page was loaded - get the data and load next page """ page = self.ui.webView.page() frame = page.currentFrame() content = frame.toHtml() print u'Page content, got %s bytes' % len(content) links = get_adtaily(content) print links if self.currentRefresh < self.refreshSite: print 'Refresh +1' self.currentRefresh += 1 else: print 'Index +1' self.currentRefresh = 0 self.currentIndex += 1 nexturl = self.__getNextUrl() if nexturl: self.ui.webView.load(nexturl)
- We pass the rendered HTML code of a loaded page to get_adtaily and print the list of URLs it found. If you run the app on a webpage that cointains AdTaily widget (like www.python.rk.edu.pl) then you should see urls of ads from this widget.
Flash ads - clickTag
- For flash ads "clickTag" (and "click", "gaadlink" etc.) is used to pass the redirect URL (through click counting script that redirects to the final site). The embed tag may look like this:
<embed height="200" width="750" name="td_flash" id="td_flash" wmode="opaque" swliveconnect="true" pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash" type="application/x-shockwave-flash" quality="best" menu="false" flashvars="clickTAG=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&CLICKTAG=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&clicktag=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&clickTag=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487&ClickTag=http%3A%2F%2Fclk.tradedoubler.com%2Fclick%3Fp%3D74000%26a%3D1545102%26g%3D17957364%26pools%3D305907%2C282487" src="http://ads.open.pl/kreacje/2009/05_09/doplata/google_750x200_z_doplata.swf"/>
- Parser for "clickTag" will look like this:
def get_clickTag(html): """ Extract data from flash ads """ links = findall(r'clicktag=(.*?)"', unicode(html).lower()) ret = [] for i in links: ret.append(unquote(i)) return ret
- You have to check how URLs are passed to flash ads on sites which you want to harvest. (unquote is from urllib)
Saving data to a database
In Qt databases are handled with QtSql component. We can use it in PyQt4, but the API won't be compatible with the standard DB API for Python modules.- Import the module:
from PyQt4.QtSql import *
- Connect to the database in __init__:
We select the driver (SQLite), and next we specify database name and connect.
self.db = QSqlDatabase.addDatabase("QSQLITE") self.db.setDatabaseName("ads") self.dbstatus = self.db.open() if self.dbstatus: print 'DB ok' else: print 'DB error'
- We also need a table for the data. This one will be ok:
You can create this table using command line tools for SQLite:
CREATE TABLE "ads_data" ( "id" integer NOT NULL PRIMARY KEY, "link" varchar(255) NOT NULL, "dest_title" varchar(255) NULL, "dest_url" varchar(255) NULL, "date" datetime NOT NULL, "source" varchar(255) NOT NULL, "is_parsed" bool NULL );
sqlite3 ./ads - We have the database, so start inserting data:
We create object query = QSqlQuery(self.db) and execute a query using exec_ method. The URL will be saved in the DB, but if an error occur - we will see the error message.
def loadFinished(self): """ A page was loaded - get the data and load next page """ page = self.ui.webView.page() frame = page.currentFrame() content = frame.toHtml() print u'Page content, got %s bytes' % len(content) links = get_adtaily(content) links2 = get_clickTag(content) for i in links2: links.append(i) query = QSqlQuery(self.db) try: source = self.sites[self.currentIndex]['site'] except: return # insert every link for link in links: qry = "INSERT INTO ads_data ('link', 'date', 'source', 'is_parsed') VALUES ('%s', '%s', '%s', 0);" % (link, date.today(), source) if query.exec_(qry): print u'INSERT' else: print 'Insert Error' print qry print query.lastError().text() print if self.currentRefresh < self.refreshSite: print 'Refresh +1' self.currentRefresh += 1 else: print 'Index +1' self.currentRefresh = 0 self.currentIndex += 1 nexturl = self.__getNextUrl() if nexturl: self.ui.webView.load(nexturl)
- Our example can be improved - by blocking duplicate links for current day:
In this snippet we selected some data. To iterate over data use query.next(), which will return True if it got another row (while query.next() for many rows). In our case we select one row. Method query.value(index) will return result data as QVariant object, so we have to make it a string or int.
def loadFinished(self): """ A page was loaded - get the data and load next page """ page = self.ui.webView.page() frame = page.currentFrame() content = frame.toHtml() print u'Page content, got %s bytes' % len(content) links = get_adtaily(content) links2 = get_clickTag(content) for i in links2: links.append(i) query = QSqlQuery(self.db) try: source = self.sites[self.currentIndex]['site'] except: # no more sites return for link in links: cnt = 0 if query.exec_("SELECT COUNT(*) FROM ads_data WHERE date = '%s' AND source = '%s' AND link = '%s'" % (date.today(), source, link)): query.next() cnt = query.value(0).toInt()[0] print cnt if cnt < 1: qry = "INSERT INTO ads_data ('link', 'date', 'source', 'is_parsed') VALUES ('%s', '%s', '%s', 0);" % (link, date.today(), source) if query.exec_(qry): print u'INSERT' else: print 'Insert Error' print qry print query.lastError().text() print else: print 'pass' if self.currentRefresh < self.refreshSite: print 'Refresh +1' self.currentRefresh += 1 else: print 'Index +1' self.currentRefresh = 0 self.currentIndex += 1 nexturl = self.__getNextUrl() if nexturl: self.ui.webView.load(nexturl)
- In third part we will create second app, that will load saved urls and update their entries with final site title and url.
Source code
RkBlog
Check out the new site at https://rkblog.dev.
Comment article