From charlesreid1

Line 187: Line 187:


[[Pywikibot/Scraping]] - how to write a MediaWiki scraper
[[Pywikibot/Scraping]] - how to write a MediaWiki scraper
==Using Custom Scripts from Outside Pywikibot Directory==
You can install pywikibot and import it from anywhere (e.g., /tmp), but you will need to include a user config file. If you try to import pywikibot without a user-config file,
<pre>
import pywikibot
</pre>
you'll see this RuntimeError:
<pre>
RuntimeError: No user-config.py found in directory '/tmp'.
  Please check that user-config.py is stored in the correct location.
  Directory where user-config.py is searched is determined as follows:
    Return the directory in which user-specific information is stored.
    This is determined in the following order:
    1.  If the script was called with a -dir: argument, use the directory
        provided in this argument.
    2.  If the user has a PYWIKIBOT2_DIR environment variable, use the value
        of it.
    3.  If user-config is present in current directory, use the current
        directory.
    4.  If user-config is present in pwb.py directory, use that directory
    5.  Use (and if necessary create) a 'pywikibot' folder under
        'Application Data' or 'AppData\Roaming' (Windows) or
        '.pywikibot' directory (Unix and similar) under the user's home
        directory.
    Set PYWIKIBOT2_NO_USER_CONFIG=1 to disable loading user-config.py
    @param test_directory: Assume that a user config file exists in this
        directory. Used to test whether placing a user config file in this
        directory will cause it to be selected as the base directory.
    @type test_directory: str or None
    @rtype: unicode
</pre>
When you are in the pywikibot directory, and you run the login script,
<pre>
python pwb.py login
</pre>
this will create a user-config.py configuration file with (almost entirely superfluous) information about the wiki you log into.
From there, it will work pretty much like it did before:
<pre>
In [1]: import pywikibot
In [2]: s = pywikibot.Site()
In [3]: s
Out[3]: APISite("en", "charlesreid1")
In [5]: pywikibot.Page(s,'Linux/Wireless')
Out[5]: Page('Linux/Wireless')
In [8]: hist = list(p.fullVersionHistory())
WARNING: /Users/charles/Library/Python/3.6/bin/ipython:1: DeprecationWarning: pywikibot.page.BasePage.fullVersionHistory is deprecated; use Page.revisions(content=True) instead.
  #!/usr/local/opt/python3/bin/python3.6
In [10]: hist = list(p.revisions(content=False))
In[11]: hist[0:3]
Out[11]:
[{'revid': 16262, 'text': None, 'timestamp': Timestamp(2017, 4, 15, 6, 35, 29), 'user': 'Admin', 'anon': False, 'comment': '/* Flags */', 'minor': False, 'rollbacktoken': None, '_parent_id': 16261, '_content_model': 'wikitext', '_sha1': 'df790e36c30e7895fea4e114d40d3515c0345b23'},
{'revid': 16261, 'text': None, 'timestamp': Timestamp(2017, 4, 15, 6, 32, 26), 'user': 'Admin', 'anon': False, 'comment': '/* Joining network with WPA encryption */', 'minor': False, 'rollbacktoken': None, '_parent_id': 16260, '_content_model': 'wikitext', '_sha1': '430d06d81ecca002199b895b4ab8de76615f86a2'},
{'revid': 16260, 'text': None, 'timestamp': Timestamp(2017, 4, 15, 6, 31, 10), 'user': 'Admin', 'anon': False, 'comment': '/* WPA Supplicant Method */', 'minor': False, 'rollbacktoken': None, '_parent_id': 16259, '_content_model': 'wikitext', '_sha1': 'ca1ee6ab00d2305992c71c14127e0fc474476a84'}]
In [12]: print(type(hist[0]))
In [19]: revision_dictionary = dict(hist[0].__dict__)
In [20]: print(revision_dictionary.keys())
dict_keys(['revid', 'text', 'timestamp', 'user', 'anon', 'comment', 'minor', 'rollbacktoken', '_parent_id', '_content_model', '_sha1'])
<class 'pywikibot.page.Revision'>
..............................................
In [49]: all_pages = list(site.allpages())
In [50]: print(type(all_pages))
<class 'list'>
In [51]: print(len(all_pages))
2292
In [52]: print(type(all_pages[0]))
<class 'pywikibot.page.Page'>
</pre>


==Page Objects==
==Page Objects==

Revision as of 21:56, 31 January 2018

Setting this up is confusing as hell, mainly because the documentation is lacking.

Pywikibot is a single standalone Python script that works a little bit like a framework. To use it, you assemble various "actions", and run each action through the pywikibot Python script. The first action you'll usually run is the login action, which stores credentials for a wiki. Then, you can run any of the other built-in actions, or define your own actions.

Getting, Configuring, Installing

I have the pywikibot software set up with two remotes: one official (Wikimedia gerrit), and one unofficial (my own git repo).

Link to pywikibot on Wikimedia Foundation's gerrit: https://gerrit.wikimedia.org/r/pywikibot/core.git

Link to pywikibot on git.charlesreid1.com: https://charlesreid1.com:3000/wiki/pywikibot

Wikimedia gerrit

Note the official pywikibot repo is also cloned on Github: https://github.com/wikimedia/pywikibot-core/

Start by checking it out:

$ git clone https://gerrit.wikimedia.org/r/pywikibot/core.git pywikibot
$ cd pywikibot

Install all the pip stuff that you may need:

$ pip install -r requirements.txt

Update git submodules:

$ git submodule update --init

Add a custom family file to the big directory of family files:

$ ls pywikibot/families
...
wikivoyage_family.py
wiktionary_family.py
wowwiki_family.py

This is where you will put your custom family file. Here's what the custom family file looks like:

from pywikibot import family

class Family(family.Family):
    def __init__(self):
        family.Family.__init__(self)
        self.name = 'charlesreid1'
        self.langs = {
            'en': 'charlesreid1.com',
        }

Copy and paste this into pwb/pywikibot/families/charlesreid1_family.py (where pwb is the name of the directory where you checked out the git repository).

Now you should be able to log into the wiki as your bot:

$ python pwb.py login
Password for user Bleep bloop on charlesreid1:en (no characters will be shown):

Logging in to charlesreid1:en as Bleep bloop
WARNING: /Users/charles/codes/pywikibot/pywikibot/tools/__init__.py:1717: UserWarning: File /Users/charles/codes/pywikibot/pywikibot.lwp had 644 mode; converted to 600 mode.
Logged in on charlesreid1:en as Bleep bloop.

git.charlesreid1.com

Link to pywikibot on git.charlesreid1.com: https://charlesreid1.com:3000/wiki/pywikibot

To push changes to the pywikibot on git.charlesreid1.com I set up the repo with another remote:

$ git remote add cmr https://charlesreid1.com:3000/wiki/pywikibot
$ git push cmr master

Running Simple Scripts

There are two ways to use pywikibot:

  • Write your own custom actions
  • Use a bundle of scripts that come packaged with pywikibot

Using Provided Scripts

Here is a list of all the pre-written scripts for MediaWiki wikis: https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Pywikibot/Scripts

These are also located in the scripts/ folder of the repository.

To run a given script, you actually run it THROUGH the pwb.py script. See example below.

Redirect.py Script

Suppose we wanted to run the script redirect.py to programmatically deal with redirects on our wiki. We can start by looking at the documentation for this file, which shows us there are many options for this script:

Script to resolve double redirects, and to delete broken redirects.

Requires access to MediaWiki's maintenance pages or to a XML dump file.
Delete function requires adminship.

Syntax:

    python pwb.py redirect action [-arguments ...]

where action can be one of these:

double         Fix redirects which point to other redirects.
do             Shortcut action command is "do".

broken         Tries to fix redirect which point to nowhere by using the last
br             moved target of the destination page. If this fails and the
               -delete option is set, it either deletes the page or marks it
               for deletion depending on whether the account has admin rights.
               It will mark the redirect not for deletion if there is no speedy
               deletion template available. Shortcut action command is "br".

both           Both of the above. Retrieves redirect pages from live wiki,
               not from a special page.

and arguments can be:

-xml           Retrieve information from a local XML dump
               (https://download.wikimedia.org). Argument can also be given as
               "-xml:filename.xml". Cannot be used with -fullscan or -moves.

-fullscan      Retrieve redirect pages from live wiki, not from a special page
               Cannot be used with -xml.

-moves         Use the page move log to find double-redirect candidates. Only
               works with action "double", does not work with -xml.

               NOTE: You may use only one of these options above.
               If neither of -xml -fullscan -moves is given, info will be
               loaded from a special page of the live wiki.

-page:title    Work on a single page

-namespace:n   Namespace to process. Can be given multiple times, for several
               namespaces. If omitted, only the main (article) namespace is
               treated.

-offset:n      With -moves, the number of hours ago to start scanning moved
               pages. With -xml, the number of the redirect to restart with
               (see progress). Otherwise, ignored.

-start:title   The starting page title in each namespace. Page need not exist.

-until:title   The possible last page title in each namespace. Page needs not
               exist.

-total:n       The maximum count of redirects to work upon. If omitted, there
               is no limit.

-delete        Prompt the user whether broken redirects should be deleted (or
               marked for deletion if the account has no admin rights) instead
               of just skipping them.

-sdtemplate:x  Add the speedy deletion template string including brackets.
               This enables overriding the default template via i18n or
               to enable speedy deletion for projects other than wikipedias.

-always        Don't prompt you for each replacement.

Suppose we want to eliminate double-redirects. To do this, we run the redirect script through pwb.py, and pass it the double argument like so:

$ python pwb.py redirect double

Customizing Pywikibot Behavior

Pywikibot/Setup - how to set up pywikibot

Pywikibot/Sites - how to perform actions on entire sites

Pywikibot/Pages - how to perform actions on individual pages and their revision histories

Pywikibot/Scraping - how to write a MediaWiki scraper

Page Objects

Page objects have a large number of available methods:

In [42]: dir(page)
Out[42]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_applicable_protections',
 '_cache_attrs',
 '_cmpkey',
 '_contentmodel',
 '_cosmetic_changes_hook',
 '_getInternals',
 '_get_parsed_page',
 '_isredir',
 '_latest_cached_revision',
 '_link',
 '_namespace_obj',
 '_pageid',
 '_protection',
 '_revid',
 '_revisions',
 '_save',
 '_timestamp',
 'applicable_protections',
 'aslink',
 'autoFormat',
 'backlinks',
 'botMayEdit',
 'canBeEdited',
 'categories',
 'change_category',
 'clear_cache',
 'content_model',
 'contributingUsers',
 'contributors',
 'coordinates',
 'data_item',
 'data_repository',
 'defaultsort',
 'delete',
 'depth',
 'editTime',
 'embeddedin',
 'encoding',
 'exists',
 'expand_text',
 'extlinks',
 'fullVersionHistory',
 'full_url',
 'get',
 'getCategoryRedirectTarget',
 'getCreator',
 'getDeletedRevision',
 'getLatestEditors',
 'getMovedTarget',
 'getOldVersion',
 'getRedirectTarget',
 'getReferences',
 'getRestrictions',
 'getTemplates',
 'getVersionHistory',
 'getVersionHistoryTable',
 'image_repository',
 'imagelinks',
 'interwiki',
 'isAutoTitle',
 'isCategory',
 'isCategoryRedirect',
 'isDisambig',
 'isEmpty',
 'isFlowPage',
 'isImage',
 'isIpEdit',
 'isRedirectPage',
 'isStaticRedirect',
 'isTalkPage',
 'is_categorypage',
 'is_filepage',
 'is_flow_page',
 'iterlanglinks',
 'itertemplates',
 'langlinks',
 'lastNonBotUser',
 'latestRevision',
 'latest_revision',
 'latest_revision_id',
 'linkedPages',
 'loadDeletedRevisions',
 'markDeletedRevision',
 'merge_history',
 'move',
 'moved_target',
 'namespace',
 'oldest_revision',
 'pageAPInfo',
 'page_image',
 'pageid',
 'permalink',
 'preloadText',
 'previousRevision',
 'previous_revision_id',
 'properties',
 'protect',
 'protection',
 'purge',
 'put',
 'put_async',
 'raw_extracted_templates',
 'removeImage',
 'replaceImage',
 'revision_count',
 'revisions',
 'save',
 'section',
 'sectionFreeTitle',
 'set_redirect_target',
 'site',
 'templates',
 'templatesWithParams',
 'text',
 'title',
 'titleForFilename',
 'titleWithoutNamespace',
 'toggleTalkPage',
 'touch',
 'undelete',
 'urlname',
 'userName',
 'version',
 'watch']

To build a graph of links on a given wiki, there are a couple of methods that are useful:

  • page.backlinks - lists pages that link to the given page
  • page.linkedPages - lists pages on this wiki that this page links to
  • page.extlinks - lists external targets that this page links to
  • page.getReferences -

backlinks

In [43]: page.backlinks
Out[43]: <bound method deprecated_args.<locals>.decorator.<locals>.wrapper of Page('AOCP/Binomial Coefficients')>

In [44]: page.backlinks()
Out[44]: <itertools.chain at 0x10838c828>

In [45]: list(page.backlinks())
Out[45]:
[Page('Flags'),
 Page('Algorithms/Combinatorics and Heuristics'),
 Page('Algorithms/Combinatorics'),
 Page('AOCP/Multisets'),
 Page('AOCP/Permutations'),
 Page('Template:AOCPFlag'),
 Page('AOCP/Multinomial Coefficients'),
 Page('AOCP/Harmonic Numbers'),
 Page('AOCP/Fibonacci Numbers'),
 Page('ACOP/Generating Functions'),
 Page('AOCP'),
 Page('AOCP/Generating Functions'),
 Page('Generating Functions'),
 Page('AOCP/Combinatorics'),
 Page('Cards'),
 Page('Binomial Coefficients'),
 Page('AOCP/Generating Permutations and Tuples'),
 Page('Letter Coverage'),
 Page('Five Letter Words')]

In [46]:

linkedPages

Asking for linkedPages() will return all pages that the current page contains links TO. This method returns a PageGenerator object, similar to the site's allpages() method. As before, we pass that into a list() method to return each item from the generator and construct a list from the results.

In [46]: list(page.linkedPages())
Out[46]:
[Page('AOCP/Boolean Functions'),
 Page('AOCP/Combinatorial Algorithms'),
 Page('AOCP/Infinite Series'),
 Page('Algorithm Analysis/Randomized Quick Sort'),
 Page('Algorithm Analysis/Substring Pattern Matching'),
 Page('ACOP/Generating Functions'),
 Page('AOCP/Combinatorics'),
 Page('AOCP/Fibonacci Numbers'),
 Page('AOCP/Five Letter Words'),
 Page('AOCP/Generating Permutations and Tuples'),
 Page('AOCP/Harmonic Numbers'),
 Page('AOCP/Multinomial Coefficients'),
 Page('AOCP/Multisets'),
 Page('Algorithm Analysis/Matrix Multiplication'),
 Page('Algorithm Analysis/Merge Sort'),
 Page('Algorithm complexity'),
 Page('Algorithmic Analysis of Sort Functions'),
 Page('Algorithms'),
 Page('Algorithms/Combinatorics'),
 Page('Algorithms/Combinatorics and Heuristics'),
 Page('Algorithms/Data Structures'),
 Page('Algorithms/Graphs'),
 Page('Algorithms/Optimization'),
 Page('Algorithms/Search'),
 Page('Algorithms/Sort'),
 Page('Algorithms/Strings'),
 Page('Amortization'),
 Page('Amortization/Accounting Method'),
 Page('Binary Search'),
 Page('Binary Search Modifications'),
 Page('CS'),
 Page('Cards'),
 Page('Divide and Conquer'),
 Page('Divide and Conquer/Master Theorem'),
 Page('Estimation'),
 Page('Estimation/BitsAndBytes'),
 Page('Five Letter Words'),
 Page('Flags'),
 Page('Heap Sort'),
 Page('Letter Coverage'),
 Page('Merge Sort'),
 Page('Project Euler'),
 Page('Quick Sort'),
 Page('Rubiks Cube/Permutations'),
 Page('Rubiks Cube/Tuples'),
 Page('Skiena Chapter 4 Questions'),
 Page('Theta vs Big O'),
 Page('Template:AOCPFlag'),
 Page('Template:AlgorithmsFlag'),
 Category('Category:AOCP')]

In [47]: type(page.linkedPages())
Out[47]: pywikibot.data.api.PageGenerator

extlinks

Asking for the external links on a given page will return a plain generator:

In [48]: type(page.extlinks())
Out[48]: generator

In [49]: list(page.extlinks())
Out[49]:
['http://charlesreid1.com/w/index.php?title=Template:AOCPFlag&action=edit',
 'http://charlesreid1.com/w/index.php?title=Template:AlgorithmsFlag&action=edit',
 'https://charlesreid1.com:3000/cs/study-plan']

getReferences

Not sure how this is different from backlinks, but it is almost entirely the same (only one item is in backlinks but not in getReferences).

In [50]: list(page.getReferences())
Out[50]:
[Page('Flags'),
 Page('Algorithms/Combinatorics and Heuristics'),
 Page('Algorithms/Combinatorics'),
 Page('AOCP/Multisets'),
 Page('AOCP/Permutations'),
 Page('Template:AOCPFlag'),
 Page('AOCP/Multinomial Coefficients'),
 Page('AOCP/Harmonic Numbers'),
 Page('AOCP/Fibonacci Numbers'),
 Page('ACOP/Generating Functions'),
 Page('AOCP'),
 Page('AOCP/Generating Functions'),
 Page('Generating Functions'),
 Page('AOCP/Combinatorics'),
 Page('Cards'),
 Page('AOCP/Generating Permutations and Tuples'),
 Page('Letter Coverage'),
 Page('Five Letter Words')]

In [51]: type(page.getReferences())
Out[51]: itertools.islice

In [52]: type(page.backlinks())
Out[52]: itertools.chain

If we ask for some help, we can see the difference between these two methods:

In [54]: help(page.getReferences)
Help on method getReferences in module pywikibot.page:

getReferences(follow_redirects=True, withTemplateInclusion=True, onlyTemplateInclusion=False, redirectsOnly=False, namespaces=None, total=None, content=False, step=NotImplemented) method of pywikibot.page.Page instance
    Return an iterator all pages that refer to or embed the page.

    If you need a full list of referring pages, use
    C{pages = list(s.getReferences())}

    @param follow_redirects: if True, also iterate pages that link to a
        redirect pointing to the page.
    @param withTemplateInclusion: if True, also iterate pages where self
        is used as a template.
    @param onlyTemplateInclusion: if True, only iterate pages where self
        is used as a template.
    @param redirectsOnly: if True, only iterate redirects to self.
    @param namespaces: only iterate pages in these namespaces
    @param total: iterate no more than this number of pages in total
    @param content: if True, retrieve the content of the current version
        of each referring page (default False)


In [55]: help(page.backlinks)
Help on method backlinks in module pywikibot.page:

backlinks(followRedirects=True, filterRedirects=None, namespaces=None, total=None, content=False, step=NotImplemented) method of pywikibot.page.Page instance
    Return an iterator for pages that link to this page.

    @param followRedirects: if True, also iterate pages that link to a
        redirect pointing to the page.
    @param filterRedirects: if True, only iterate redirects; if False,
        omit redirects; if None, do not filter
    @param namespaces: only iterate pages in these namespaces
    @param total: iterate no more than this number of pages in total
    @param content: if True, retrieve the content of the current version
        of each referring page (default False)

Flags