Difference between revisions of "User:BenSbot/Code1"

From JoCopedia
Jump to navigation Jump to search
m (updated with bug fix included)
(Updated code)
Line 1: Line 1:
 
Here is the code for the first function I was programmed to do. If you have any questions ask in the discussion page.
 
Here is the code for the first function I was programmed to do. If you have any questions ask in the discussion page.
  
<pre style="
+
<pre>
white-space: -moz-pre-wrap;
 
white-space: -pre-wrap;
 
white-space: -o-pre-wrap;
 
white-space: pre-wrap;
 
word-wrap: break-word;
 
">
 
 
import wikipedia
 
import wikipedia
 
import catlib
 
import catlib
Line 36: Line 30:
 
             linkslist[lx] = (x,1)
 
             linkslist[lx] = (x,1)
  
text = "The following is a list of the songs [[Jonathan Coulton]] has played in concert.  This list has been compiled from the setlists currently available here on JoCopedia in the [[:Category:Shows|Shows]] section, by an awesome bot designed by user [[User:BenS|BenS]].  Keep in mind that not all setlists are currently available to JoCopedia, and not all setlists are 100%.  But this is a pretty good indicator.  This list is current as of " + str(datetime.date.today()) + "\n\n"
+
bls = wikipedia.Page(site, u"User:BenSbot/Code1/Blacklist").get()
 +
blacklist = q.findall(bls)
 +
for link in blacklist:
 +
    if link.lower() in linkslist:
 +
        del linkslist[link.lower()]
 +
 
 +
 
 +
text = "The following is a list of the songs [[Jonathan Coulton]] has played \
 +
in concert.  This list has been compiled from the setlists currently available \
 +
here on JoCopedia in the [[:Category:Shows|Shows]] section, by an awesome bot \
 +
designed by user [[User:BenS|BenS]].  Keep in mind that not all setlists are \
 +
currently available to JoCopedia, and not all setlists are 100%.  But this is a\
 +
pretty good indicator.  This list is current as of " + \
 +
str(datetime.date.today()) + "\n\n"
 +
 
  
 
items = linkslist.values()
 
items = linkslist.values()
Line 65: Line 73:
 
### Run regular expression r on it replacing all positive results with "]]" (the intention of this is to prevent links given a name other than the page name from breaking the code)
 
### Run regular expression r on it replacing all positive results with "]]" (the intention of this is to prevent links given a name other than the page name from breaking the code)
 
### Add the items to a library (contains link and count, if lowercase link not present new entry added, if is present 1 is added to existing entries count)(this is the most complex bit of code in there as I has made it non-case sensitive)
 
### Add the items to a library (contains link and count, if lowercase link not present new entry added, if is present 1 is added to existing entries count)(this is the most complex bit of code in there as I has made it non-case sensitive)
 +
# Get page [[User:BenSbot/Code1/Blacklist]] and run a regular expression on it to produce a list of known non-songs found in songslist.
 +
# Check to see if the known non songs are in songslist, and if so remove them.
 
# Define variable "text" with the introductory paragraph MitchO wrote, affixing todays date and two line breaks on the end.
 
# Define variable "text" with the introductory paragraph MitchO wrote, affixing todays date and two line breaks on the end.
 
# The next bit of code sorts by count then by link so entries with the same count are in alphabetical order, then it affixes them to the end of "text" in the correct format
 
# The next bit of code sorts by count then by link so entries with the same count are in alphabetical order, then it affixes them to the end of "text" in the correct format

Revision as of 12:13, 21 July 2009

Here is the code for the first function I was programmed to do. If you have any questions ask in the discussion page.

import wikipedia
import catlib
import pagegenerators
import re
import datetime


text = ""
site = wikipedia.getSite()
linkslist = {}
p = re.compile('^[^#\\n].*$' , re.M)
q = re.compile('\\[\\[[^\\]]*\\]\\]')
r = re.compile('[ \t]{0,2}\\|.*')

showscat = catlib.Category(site,'Category:Shows')
showslist = list(pagegenerators.CategorizedPageGenerator(showscat))
for b in showslist: 
    text = p.sub("",b.get())
    links = q.findall(text)
    for x in links:
        x = r.sub("]]",x)
        lx = x.lower()
        if linkslist.has_key(lx):
            v = linkslist[lx][1] + 1
            linkslist[lx] = (linkslist[lx][0], v)
        else:
            linkslist[lx] = (x,1)

bls = wikipedia.Page(site, u"User:BenSbot/Code1/Blacklist").get()
blacklist = q.findall(bls)
for link in blacklist:
    if link.lower() in linkslist:
        del linkslist[link.lower()]


text = "The following is a list of the songs [[Jonathan Coulton]] has played \
in concert.  This list has been compiled from the setlists currently available \
here on JoCopedia in the [[:Category:Shows|Shows]] section, by an awesome bot \
designed by user [[User:BenS|BenS]].  Keep in mind that not all setlists are \
currently available to JoCopedia, and not all setlists are 100%.  But this is a\
pretty good indicator.  This list is current as of " + \
str(datetime.date.today()) + "\n\n"


items = linkslist.values()

items.sort(lambda x,y: cmp(y[1], x[1]) or cmp(x[0], y[0]))
for l, c in items:
    text = text + ("*" + l + ": " + repr(c) + "\n")

text = text + "\n" + "[[Category:Show Statistics]]"

page = wikipedia.Page(site, u"SongStats")
page.put(text, u"Song statistics")

print "fin"

Explanation

Here is a step by step explanation of what the above means:

  1. Imports python functionality (both python's own and pywikipedia's)
  2. Define some variables, mostly as empty to add to later
  3. Define three regular expressions: the first looks for lines that do not start with a hash (p); the second looks for anything contained in double square brackets with no square brackets within the double square brackets (q) and the third looks for anything after "|" (with two optional spaces before) inclusive (r)
  4. Turn the pages in the category shows into a list
  5. Go through the list one by one and
    1. Collect the text from the page (in wiki code) (b.get())
    2. Run regular expression p on it removing all positive results (removing anything but lines that start with a hash(a numbered list))
    3. Run regular expression q on it putting all internal links into a list
    4. Go through the list of links one by one and
      1. Run regular expression r on it replacing all positive results with "]]" (the intention of this is to prevent links given a name other than the page name from breaking the code)
      2. Add the items to a library (contains link and count, if lowercase link not present new entry added, if is present 1 is added to existing entries count)(this is the most complex bit of code in there as I has made it non-case sensitive)
  6. Get page User:BenSbot/Code1/Blacklist and run a regular expression on it to produce a list of known non-songs found in songslist.
  7. Check to see if the known non songs are in songslist, and if so remove them.
  8. Define variable "text" with the introductory paragraph MitchO wrote, affixing todays date and two line breaks on the end.
  9. The next bit of code sorts by count then by link so entries with the same count are in alphabetical order, then it affixes them to the end of "text" in the correct format
  10. The category is added to the end of "text"
  11. Text is put on the wiki as the page "SongStats"
  12. The python module prints fin so I know it is done