Skip to content Skip to sidebar Skip to footer

Python Post Request Not Returning Html, Requesting Javascript Be Enabled

I'm trying to sign in to my Wells Fargo account and scrape my transaction history so that I can use them to track my finances. I am able to do the scraping part if I can get to the

Solution 1:

I know that a great deal of time has passed on this, but I can give some closure here. What you're seeing is bot-defeat code sold by the good fellows at F5 Networks, Inc., designed to prevent naive webcrawlers and scrapers from being able to access sites that use it.

Briefly, this is obfuscated Javascript which calculates a value through a series of iterative steps which exercise various browser-specific Javascript capabilities, and makes use of some rather rude Javascript language behavior. That value is sent back to Wells Fargo as cookies and part of the webforms required for navigation. Just using a headless browser is not going to cut it - there are a few tricks in the calculation designed specifically to counter headless browsers and the Javascript engines that work with them. Missing any of the tricks will not cause any sort of failure; instead, it will just throw off the end result in a way which makes it difficult for you to tell what you missed.

It is, in theory, possible to decipher the code and emulate all the calculations in the language of your choice; I know of a successful countermeasure written by a data aggregation company, but the code is not open for public perusal. Alternately, you could figure out what you need to correctly execute it as-is in a JS interpreter. I don't remember all the details, but it's easier than it looks. You don't need to reverse engineer the whole thing, you just need to run it in the right environment. You need a dummy window object and more dummies for whatever else the code is looking for like navigator.userAgent in your environment, plus maybe other things.

For practical purposes, it's probably not worth it to write a countermeasure. Ask to be whitelisted if you're an organization.

If you are interested in the challenge, here is a (perhaps obvious) starting point - the long string of gibberish in the eval((ie9rgb4=function (){var m='function () ... .slice ... portion is ciphered code. The immediately following for loop contains character transformations. You can replicate the operation being done in that loop to decipher the first level of obfuscation. Log on to the site through your normal browser with a debugger active, observe the requests and cookies sent for an idea of the final goal you're looking for, and try to correlate that with the JS code you see.

You may also find the following mapping of values useful at some point:

{"$$$", "7"},
{"$$$$", "f"},
{"$$$_", "e"},
{"$$_", "6"},
{"$$_$", "d"},
{"$$__", "c"},
{"$_", "constructor"},
{"$_$", "5"},
{"$_$$", "b"},
{"$_$_", "a"},
{"$__", "4"},
{"$__$", "9"},
{"$___", "8"},
{"_", "u"},
{"_$", "o"},
{"_$$", "3"},
{"_$_", "2"},
{"__", "t"},
{"__$", "1"},
{"___", "0"}

Solution 2:

It can be used by using Splash (another JS renderer besides Selenium). Since I use Scrapy, I use Scrapy-Splash. In my Scrapy spider, I use Splash but not just that. The Splash request should be helped with a lua script to get extra command to get cookies from the web page or else it will still get blocked by the F5 security mechanism. After getting the cookies, re-request the page using the generated cookies, and done!

The code in Scrapy will be like this:

defstart_requests(self):
    lua_script = '''
    function main(splash)
      local url = splash.args.url
      assert(splash:go(url))
      assert(splash:wait(2))
      return {
        html = splash:html(),
        cookies = splash:get_cookies(),
      }
    end
    '''yield SplashRequest(self.start_urls[0], self.parse,
            endpoint='execute',
            args={'wait': 1, 'lua_source': lua_script},)


defparse(self, response):
    lua_script = '''
    function main(splash)
      splash:init_cookies(splash.args.cookies)
      local url = splash.args.url
      assert(splash:go(url))
      assert(splash:wait(2))
      return {
        html = splash:html(),
      }
    end
    '''yield SplashRequest(self.start_urls[0], self.parse_result,
            endpoint='execute',
            args={'wait': 1, 'lua_source': lua_script},dont_filter=True)

defparse_result(self, response):
    # Do your scrapy parsing thing here

Solution 3:

Some websites that make use of javascript can't be scraped just by downloading the html and passing it to an html parser because the content is simply not there. Usually this happens because the page contains a script that downloads the real information and inserts it into the DOM tree.

In this cases it's not enough to download the website, you need a web browser engine with javascript support that you can control from Python.

Here there is a list of projects you could use for this: https://github.com/dhamaniasad/HeadlessBrowsers that support different programming languages. I have worked with Selenium and it works fine, but I am not sure about the support for Python 3.5.

Post a Comment for "Python Post Request Not Returning Html, Requesting Javascript Be Enabled"