BigBrotherBot v1.9.0
System Development Information for the BigBrotherBot project.

b3::lib::feedparser Namespace Reference

Classes

class  ThingsNobodyCaresAboutButMe
class  CharacterEncodingOverride
class  CharacterEncodingUnknown
class  NonXMLContentType
class  UndeclaredNamespace
class  FeedParserDict
class  _FeedParserMixin
class  _StrictFeedParser
class  _BaseHTMLProcessor
class  _LooseFeedParser
class  _RelativeURIResolver
class  _HTMLSanitizer
class  _FeedURLHandler

Functions

def _xmlescape
def dict
def zopeCompatibilityHack
def _ebcdic_to_ascii
def _urljoin
def _resolveRelativeURIs
def _sanitizeHTML
def _open_resource
def registerDateHandler
def _parse_date_iso8601
def _parse_date_onblog
def _parse_date_nate
def _parse_date_mssql
def _parse_date_greek
def _parse_date_hungarian
def _parse_date_w3dtf
def _parse_date_rfc822
def _parse_date
def _getCharacterEncoding
def _toUTF8
def _stripDoctype
def parse

Variables

string __version__ = "4.1"
string __license__
string __author__ = "Mark Pilgrim <http://diveintomark.org/>"
list __contributors__
int _debug = 0
string USER_AGENT = "UniversalFeedParser/%s +http://feedparser.org/"
string ACCEPT_HEADER = "application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1"
list PREFERRED_XML_PARSERS = ["drv_libxml2"]
int TIDY_MARKUP = 0
list PREFERRED_TIDY_INTERFACES = ["uTidy", "mxTidy"]
 gzip = None
 zlib = None
int _XML_AVAILABLE = 1
 base64 = binasciiNone
 chardet = None
dictionary SUPPORTED_VERSIONS
 UserDict = dict
 _ebcdic_to_ascii_map = None
tuple _urifixer = re.compile('^([A-Za-z][A-Za-z0-9+-.]*://)(/*)(.*?)')
list _date_handlers = []
list _iso8601_tmpl
list _iso8601_re
list _iso8601_matches = [re.compile(regex).match for regex in _iso8601_re]
string _korean_year = u'\ub144'
string _korean_month = u'\uc6d4'
string _korean_day = u'\uc77c'
string _korean_am = u'\uc624\uc804'
string _korean_pm = u'\uc624\ud6c4'
 _korean_onblog_date_re = \
 _korean_nate_date_re = \
 _mssql_date_re = \
 _greek_months = \
 _greek_wdays = \
 _greek_date_format_re = \
 _hungarian_months = \
 _hungarian_date_format_re = \
dictionary _additional_timezones = {'AT': -400, 'ET': -500, 'CT': -600, 'MT': -700, 'PT': -800}
list urls = sys.argv[1:]
tuple result = parse(url)

Function Documentation

def b3::lib::feedparser::_ebcdic_to_ascii (   s) [private]
def b3::lib::feedparser::_getCharacterEncoding (   http_headers,
  xml_data 
) [private]
Get the character encoding of the XML document

http_headers is a dictionary
xml_data is a raw string (not Unicode)

This is so much trickier than it sounds, it's not even funny.
According to RFC 3023 ('XML Media Types'), if the HTTP Content-Type
is application/xml, application/*+xml,
application/xml-external-parsed-entity, or application/xml-dtd,
the encoding given in the charset parameter of the HTTP Content-Type
takes precedence over the encoding given in the XML prefix within the
document, and defaults to 'utf-8' if neither are specified.  But, if
the HTTP Content-Type is text/xml, text/*+xml, or
text/xml-external-parsed-entity, the encoding given in the XML prefix
within the document is ALWAYS IGNORED and only the encoding given in
the charset parameter of the HTTP Content-Type header should be
respected, and it defaults to 'us-ascii' if not specified.

Furthermore, discussion on the atom-syntax mailing list with the
author of RFC 3023 leads me to the conclusion that any document
served with a Content-Type of text/* and no charset parameter
must be treated as us-ascii.  (We now do this.)  And also that it
must always be flagged as non-well-formed.  (We now do this too.)

If Content-Type is unspecified (input was local file or non-HTTP source)
or unrecognized (server just got it totally wrong), then go by the
encoding given in the XML prefix of the document and default to
'iso-8859-1' as per the HTTP specification (RFC 2616).

Then, assuming we didn't find a character encoding in the HTTP headers
(and the HTTP Content-type allowed us to look in the body), we need
to sniff the first few bytes of the XML data and try to determine
whether the encoding is ASCII-compatible.  Section F of the XML
specification shows the way here:
http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

If the sniffed encoding is not ASCII-compatible, we need to make it
ASCII compatible so that we can sniff further into the XML declaration
to find the encoding attribute, which will tell us the true encoding.

Of course, none of this guarantees that we will be able to parse the
feed in the declared character encoding (assuming it was declared
correctly, which many are not).  CJKCodecs and iconv_codec help a lot;
you should definitely install them if you can.
http://cjkpython.i18n.org/
def b3::lib::feedparser::_open_resource (   url_file_stream_or_string,
  etag,
  modified,
  agent,
  referrer,
  handlers 
) [private]
URL, filename, or string --> stream

This function lets you define parsers that take any input source
(URL, pathname to local or network file, or actual data as a string)
and deal with it in a uniform manner.  Returned object is guaranteed
to have all the basic stdio read methods (read, readline, readlines).
Just .close() the object when you're done with it.

If the etag argument is supplied, it will be used as the value of an
If-None-Match request header.

If the modified argument is supplied, it must be a tuple of 9 integers
as returned by gmtime() in the standard Python time module. This MUST
be in GMT (Greenwich Mean Time). The formatted date/time will be used
as the value of an If-Modified-Since request header.

If the agent argument is supplied, it will be used as the value of a
User-Agent request header.

If the referrer argument is supplied, it will be used as the value of a
Referer[sic] request header.

If handlers is supplied, it is a list of handlers used to build a
urllib2 opener.
def b3::lib::feedparser::_parse_date (   dateString) [private]
Parses a variety of date formats into a 9-tuple in GMT
def b3::lib::feedparser::_parse_date_greek (   dateString) [private]
Parse a string according to a Greek 8-bit date format.
def b3::lib::feedparser::_parse_date_hungarian (   dateString) [private]
Parse a string according to a Hungarian 8-bit date format.
def b3::lib::feedparser::_parse_date_iso8601 (   dateString) [private]
Parse a variety of ISO-8601-compatible formats like 20040105
def b3::lib::feedparser::_parse_date_mssql (   dateString) [private]
Parse a string according to the MS SQL date format
def b3::lib::feedparser::_parse_date_nate (   dateString) [private]
Parse a string according to the Nate 8-bit date format
def b3::lib::feedparser::_parse_date_onblog (   dateString) [private]
Parse a string according to the OnBlog 8-bit date format
def b3::lib::feedparser::_parse_date_rfc822 (   dateString) [private]
Parse an RFC822, RFC1123, RFC2822, or asctime-style date
def b3::lib::feedparser::_parse_date_w3dtf (   dateString) [private]
def b3::lib::feedparser::_resolveRelativeURIs (   htmlSource,
  baseURI,
  encoding 
) [private]
def b3::lib::feedparser::_sanitizeHTML (   htmlSource,
  encoding 
) [private]
def b3::lib::feedparser::_stripDoctype (   data) [private]
Strips DOCTYPE from XML document, returns (rss_version, stripped_data)

rss_version may be 'rss091n' or None
stripped_data is the same XML document, minus the DOCTYPE
def b3::lib::feedparser::_toUTF8 (   data,
  encoding 
) [private]
Changes an XML data stream on the fly to specify a new encoding

data is a raw sequence of bytes (not Unicode) that is presumed to be in %encoding already
encoding is a string recognized by encodings.aliases
def b3::lib::feedparser::_urljoin (   base,
  uri 
) [private]
def b3::lib::feedparser::_xmlescape (   data) [private]
def b3::lib::feedparser::dict (   aList)
def b3::lib::feedparser::parse (   url_file_stream_or_string,
  etag = None,
  modified = None,
  agent = None,
  referrer = None,
  handlers = [] 
)
Parse a feed from a URL, file, stream, or string
def b3::lib::feedparser::registerDateHandler (   func)
Register a date handler function (takes string, returns 9-tuple date in GMT)
def b3::lib::feedparser::zopeCompatibilityHack ( )

Variable Documentation

string b3::lib::feedparser::__author__ = "Mark Pilgrim <http://diveintomark.org/>"
Initial value:
00001 ["Jason Diamond <http://injektilo.org/>",
00002                     "John Beimler <http://john.beimler.org/>",
00003                     "Fazal Majid <http://www.majid.info/mylos/weblog/>",
00004                     "Aaron Swartz <http://aaronsw.com/>",
00005                     "Kevin Marks <http://epeus.blogspot.com/>"]
Initial value:
00001 """Copyright (c) 2002-2006, Mark Pilgrim, All rights reserved.
00002 
00003 Redistribution and use in source and binary forms, with or without modification,
00004 are permitted provided that the following conditions are met:
00005 
00006 * Redistributions of source code must retain the above copyright notice,
00007   this list of conditions and the following disclaimer.
00008 * Redistributions in binary form must reproduce the above copyright notice,
00009   this list of conditions and the following disclaimer in the documentation
00010   and/or other materials provided with the distribution.
00011 
00012 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 'AS IS'
00013 AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
00014 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
00015 ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
00016 LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
00017 CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
00018 SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
00019 INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
00020 CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
00021 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
00022 POSSIBILITY OF SUCH DAMAGE."""
dictionary b3::lib::feedparser::_additional_timezones = {'AT': -400, 'ET': -500, 'CT': -600, 'MT': -700, 'PT': -800}
list b3::lib::feedparser::_iso8601_matches = [re.compile(regex).match for regex in _iso8601_re]
Initial value:
00001 [
00002     tmpl.replace(
00003     'YYYY', r'(?P<year>\d{4})').replace(
00004     'YY', r'(?P<year>\d\d)').replace(
00005     'MM', r'(?P<month>[01]\d)').replace(
00006     'DD', r'(?P<day>[0123]\d)').replace(
00007     'OOO', r'(?P<ordinal>[0123]\d\d)').replace(
00008     'CC', r'(?P<century>\d\d$)')
00009     + r'(T?(?P<hour>\d{2}):(?P<minute>\d{2})'
00010     + r'(:(?P<second>\d{2}))?'
00011     + r'(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?'
00012     for tmpl in _iso8601_tmpl]
Initial value:
00001 ['YYYY-?MM-?DD', 'YYYY-MM', 'YYYY-?OOO',
00002                 'YY-?MM-?DD', 'YY-?OOO', 'YYYY', 
00003                 '-YY-?MM', '-OOO', '-YY',
00004                 '--MM-?DD', '--MM',
00005                 '---DD',
00006                 'CC', '']
string b3::lib::feedparser::_korean_am = u'\uc624\uc804'
string b3::lib::feedparser::_korean_pm = u'\uc624\ud6c4'
tuple b3::lib::feedparser::_urifixer = re.compile('^([A-Za-z][A-Za-z0-9+-.]*://)(/*)(.*?)')
string b3::lib::feedparser::ACCEPT_HEADER = "application/atom+xml,application/rdf+xml,application/rss+xml,application/x-netcdf,application/xml;q=0.9,text/xml;q=0.2,*/*;q=0.1"
tuple b3::lib::feedparser::result = parse(url)
Initial value:
00001 {'': 'unknown',
00002                       'rss090': 'RSS 0.90',
00003                       'rss091n': 'RSS 0.91 (Netscape)',
00004                       'rss091u': 'RSS 0.91 (Userland)',
00005                       'rss092': 'RSS 0.92',
00006                       'rss093': 'RSS 0.93',
00007                       'rss094': 'RSS 0.94',
00008                       'rss20': 'RSS 2.0',
00009                       'rss10': 'RSS 1.0',
00010                       'rss': 'RSS (unknown version)',
00011                       'atom01': 'Atom 0.1',
00012                       'atom02': 'Atom 0.2',
00013                       'atom03': 'Atom 0.3',
00014                       'atom10': 'Atom 1.0',
00015                       'atom': 'Atom (unknown version)',
00016                       'cdf': 'CDF',
00017                       'hotrss': 'Hot RSS'
00018                       }
list b3::lib::feedparser::urls = sys.argv[1:]
string b3::lib::feedparser::USER_AGENT = "UniversalFeedParser/%s +http://feedparser.org/"
 All Classes Namespaces Files Functions Variables Properties