Harvesting data from websites using WebKit and PyQt4 - part 1

Check out the new site at https://rkblog.dev.

Qt WebKit integration allowed developers to build new app, new tools that can harvest complex web pages (websites) for data, or to manipulate the data on the website. For example if we look at a plain source code of a website we will see that all ads, or most of them (flash, Google, others) are inserted as a JavaScript code. We can't tell what ad it contains, and where it links to. But if you look at the code in for example Firefox Firebug you will see that the JavaScript have been replaced with the HTML code of the add. Using WebKit in PyQt4 we can write an app that will collect data of all ads on a webpage and parse the data for marketing guys.

Plan

We start with an app that will collect links from ads and save them to a SQLite database:
  • Load N-times every page from a list
  • After load extract links from ads and save them do database
  • Various ad formats handled by a simple plugin system
Next stage will be to make application that will open ad URLs saved in DB and update their entries with final page title and URL.
PyQt4 for Windows contains only SQLite driver. If you wish to use other DB on Windows then you will have to use a python driver for it or compile PyQt4 on Windows with the driver.

Gathering data

The run of the gathering app will look like this:
  • We click "start" - the app loads the first webpage from list
  • It reloads it N-times
  • Every reload it save to DB all ad links and the site on which the ad was discovered
  • Then it continues with the next webpage on the list
So we launch Qt Designer and design simple UI, like this:
add1
We have "Start" button (QPushButton), two QProgressBars - one for measuring webpages that had been loaded, and the second one for refresh "counter". Also we have the main element of the app - QWebView that will load the pages. Labels are shown on the screenshot. When we have the UI file we make a Python class from it:
pyuic4 gather.ui > gather.py
Next step is a run.py file with a skeleton code that will run the application:
# -*- coding: utf-8 -*-
import sys

from PyQt4 import QtCore, QtGui
from gather import Ui_gatherer

class GatherAds(QtGui.QMainWindow):
	def __init__(self, parent=None):
		QtGui.QWidget.__init__(self, parent)
		self.ui = Ui_gatherer()
		self.ui.setupUi(self)
		# refresh counter
		self.refreshSite = 5
		
		# list of page to load, Polish :)
		self.sites = [{'url': 'http://auth.gazeta.pl/googleAuth/login.do', 'site': 'gazeta.pl'},
		{'url': 'http://kobieta.gazeta.pl/kobieta/0,0.html', 'site': 'gazeta.pl'},
		{'url': 'http://kobieta.wp.pl', 'site': 'wp.pl'},
		]
	
if __name__ == "__main__":
	app = QtGui.QApplication(sys.argv)
	myapp = GatherAds()
	myapp.show()
	sys.exit(app.exec_())

This code will show the app window, but the app won't do anything. There is also a list of webpages to load. self.sites is a list of dictionaries, and every one contains url - to load, and site - site "name" which will be save with ad url.

We have to implement chain page loading. We can load next page after current page got loaded, like this:
# -*- coding: utf-8 -*-
import sys

from PyQt4 import QtCore, QtGui
from gather import Ui_gatherer

class GatherAds(QtGui.QMainWindow):
	def __init__(self, parent=None):
		QtGui.QWidget.__init__(self, parent)
		self.ui = Ui_gatherer()
		self.ui.setupUi(self)
		self.refreshSite = 5
		
		self.currentIndex = 0
		self.sites = [{'url': 'http://auth.gazeta.pl/googleAuth/login.do', 'site': 'gazeta.pl'},
		{'url': 'http://kobieta.gazeta.pl/kobieta/0,0.html', 'site': 'gazeta.pl'},
		{'url': 'http://kobieta.wp.pl', 'site': 'wp.pl'},
		]
		
		QtCore.QObject.connect(self.ui.startButton,QtCore.SIGNAL("clicked()"), self.start)
		QtCore.QObject.connect(self.ui.webView,QtCore.SIGNAL("loadFinished (bool)"), self.loadFinished)
		QtCore.QObject.connect(self.ui.webView,QtCore.SIGNAL("loadProgress (int)"), self.loadProgress)
		
	def start(self):
		"""
		Start loading the web pages
		"""
		self.ui.startButton.setEnabled(False)
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)
	
	def loadFinished(self):
		"""
		A page was loaded - get the data and load next page
		"""
		page = self.ui.webView.page()
		frame = page.currentFrame()
		content = frame.toHtml()
		print u'Page content, got %s bytes' % len(content)
		
		# process the data here
		
		self.currentIndex += 1
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)
	
	def loadProgress(self, progress):
		"""
		Print the progress of page load
		"""
		print progress
	
	def __getNextUrl(self):
		"""
		Return next URL in list
		"""
		if len(self.sites) - 1 >= self.currentIndex:
			newurl = QtCore.QUrl(self.sites[self.currentIndex]['url'])
		else:
			print 'No next url'
			newurl = False
		
		return newurl
		

if __name__ == "__main__":
	app = QtGui.QApplication(sys.argv)
	myapp = GatherAds()
	myapp.show()
	sys.exit(app.exec_())

The most important is __getNextUrl method, which returns URL of a webpage to load. self.currentIndex starts from 0 (first element in list), and when the page loads in loadFinished we increment the value and call __getNextUrl again until we hit end of the list. start method attached to "Start" button loads the first page starting the chain. As a helper I used loadProgress to print the progress of page load in the console.

The pages are being loaded, but we can improve the process. We can use QtWebKit.QWebSettings to for example - turn off images loading:
s = self.ui.webView.settings()
s.setAttribute(QtWebKit.QWebSettings.AutoLoadImages, False)
s.setAttribute(QtWebKit.QWebSettings.JavascriptCanOpenWindows, False)
s.setAttribute(QtWebKit.QWebSettings.PluginsEnabled, False)
The pages load only one time, and we want to refresh the page few times to get most of the adds that can show up on it:
# -*- coding: utf-8 -*-
import sys

from PyQt4 import QtCore, QtGui, QtWebKit
from gather import Ui_gatherer

class GatherAds(QtGui.QMainWindow):
	def __init__(self, parent=None):
		QtGui.QWidget.__init__(self, parent)
		self.ui = Ui_gatherer()
		self.ui.setupUi(self)
		self.refreshSite = 3
		
		self.currentIndex = 0
		self.currentRefresh = 0
		self.sites = [{'url': 'http://auth.gazeta.pl/googleAuth/login.do', 'site': 'gazeta.pl'},
		{'url': 'http://kobieta.gazeta.pl/kobieta/0,0.html', 'site': 'gazeta.pl'},
		{'url': 'http://kobieta.wp.pl', 'site': 'wp.pl'},
		]
		
		s = self.ui.webView.settings()
		s.setAttribute(QtWebKit.QWebSettings.AutoLoadImages, False)
		s.setAttribute(QtWebKit.QWebSettings.JavascriptCanOpenWindows, False)
		s.setAttribute(QtWebKit.QWebSettings.PluginsEnabled, False)
		
		QtCore.QObject.connect(self.ui.startButton,QtCore.SIGNAL("clicked()"), self.start)
		QtCore.QObject.connect(self.ui.webView,QtCore.SIGNAL("loadFinished (bool)"), self.loadFinished)
		QtCore.QObject.connect(self.ui.webView,QtCore.SIGNAL("loadProgress (int)"), self.loadProgress)
		
	def start(self):
		"""
		Start loading the web pages
		"""
		self.ui.startButton.setEnabled(False)
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)
	
	def loadFinished(self):
		"""
		A page was loaded - get the data and load next page
		"""
		page = self.ui.webView.page()
		frame = page.currentFrame()
		content = frame.toHtml()
		print u'Page content, got %s bytes' % len(content)
		
		# process the data here
		
		if self.currentRefresh < self.refreshSite:
			print 'Refresh +1'
			self.currentRefresh += 1
		else:
			print 'Index +1'
			self.currentRefresh = 0
			self.currentIndex += 1
		
		nexturl = self.__getNextUrl()
		if nexturl:
			self.ui.webView.load(nexturl)
	
	def loadProgress(self, progress):
		"""
		Print the progress of page load
		"""
		print progress
	
	def __getNextUrl(self):
		"""
		Return next URL in list
		"""
		# set the progress bar of pages loaded
		progress_value = (float(self.currentIndex)/float(len(self.sites)))*100
		self.ui.sitesBar.setValue(progress_value)
		
		# set the progress bar of refreshes
		progress_value = (float(self.currentRefresh)/float(self.refreshSite))*100
		self.ui.iterationBar.setValue(progress_value)
		
		if len(self.sites) - 1 >= self.currentIndex:
			newurl = QtCore.QUrl(self.sites[self.currentIndex]['url'])
		else:
			print 'No next url'
			newurl = False
		
		return newurl
		

if __name__ == "__main__":
	app = QtGui.QApplication(sys.argv)
	myapp = GatherAds()
	myapp.show()
	sys.exit(app.exec_())
The solution is similar. We use self.currentRefresh to hold the current refresh count. In loadProgress we check if it's less than max refresh count (self.refreshSite) - if so we don't increment the self.currentIndex (but only self.currentRefresh) so the same site will be returned by __getNextUrl. If the refresh count reach limit we reset the counter and increment self.currentIndex for the next page. In __getNextUrl I've also added sitesBar and iterationBar progress values.

The app has fully working chain of page loads. In next article we will implement plugins that will extract data from loaded pages

add2

Source Code

RkBlog

PyQt and GUI, 10 November 2009


Check out the new site at https://rkblog.dev.
Comment article
Comment article RkBlog main page Search RSS Contact