[HAF] Scrape/Parse this page - get $100 instantly + job offer (Python/C#/Node.js)

yakuzaemme

Senior Member
Joined
Sep 18, 2016
Messages
1,007
Reaction score
2,930
I'm looking for another talented developer to join the team, and thought I'd make the test challenge public - both so others can learn, but to make it a bit of fun as well.

The test is very straight-forward:
1. Parse business data from URL listed below
2. No headless browser, strictly DOM parsing
3. First to publically (in this thread) post code + result wins $100 + job offer invitation

URL: view-source:https://www.google.com/maps?cid=6798692991630522225

Business data:
  • Name (Gillinge Mc-skola)
  • Street (Gillinge 55)
  • Postal code / Zip code (186 91)
  • City (Vallentuna)
  • Website (gillingemcskola.se)
  • Phone number (08-32 10 01)
  • Opening Hours (Monday => 13-16, Tuesday => 13-16 etc)
To be extremely clear:
- Do not use Selenium or similar. It is stricty DOM parsing (what you can see when you do view-source)
- The data is embedded in nested JSON arrays
- Search for 'Monday' within the page source to find the business object
- Obviously, the code should work for other business ids (so don't hardcode the values)

The winner will be the one that posts complete working code (in Python, C#, Node.js) + results as JSON.

You will get $100 transfered by PayPal / Payoneer / Bitcoin / Bank transfer + a job offer for continuous work + show off to the rest of BHW members ;)

Edit: And if not clear already, the budget is $100 (the prize).

Good luck!
 
Last edited:
Sounds interesting,
I can do it in node.js
Can you share more details about the job offer?
 
Sounds interesting,
I can do it in node.js
Can you share more details about the job offer?

The job offer will be seperate - you'll have to complete this HAF-task beforehand. This is to save both of our time to ensure that whoever gets the job offer is skilled enough.
 
Some more place IDs that you can use to test against:
10000019793623989616
10000040780233714605
10000086672503462587
10000592203216891301
10000963258458223218
10001142239793900479
1000116714141165284
10001269367349967376
10001443843767080547
10001443859171381242
10001448750397738234
10001511234001906756
1000156145038737246
10001565075152990459
10001697825982007962
10001975972171018625
1000221358549931801
10002496500171802795
10002508701108081043
10002668132963746796
10002840312897399938
10003104957344365081
10003112282200718213
10003240790534523167
10003309393667921689
100034021075390429
10003784888068078304
10003896028101446449
10003980495181432806
10004030020329006703
 
Here is solution in python3:
Anyway solution dont work :D

Its not pretty, its quick and dirty hack to get things done ASAP, there is probably better way to do it but since you wanted it fast I was looking for quickest way. Tested it on 5 random urls from map.
 
Last edited:
Here is solution in python3:


Its not pretty, its quick and dirty hack to get things done ASAP, there is probably better way to do it but since you wanted it fast I was looking for quickest way. Tested it on 5 random urls from map.

I will have a look in the morning (10pm Sweden).

I took a quick peek, you don't have to be able to handle url2-5 cases, only url type #1.
If you are able to clean up the code based on the fact above, please do so, will help me to go through it line by line :p

Edit:
I ran it through a random ID (from the list above) and code broke (self.get_phone())
https://i.gyazo.com/28aa0f56f300ad4ebdc3606f1a38e179.png
With another ID, it wasn't able to fetch website even though the listing had a website
 
Last edited:
I will have a look in the morning (10pm Sweden).

I took a quick peek, you don't have to be able to handle url2-5 cases, only url type #1.
If you are able to clean up the code based on the fact above, please do so, will help me to go through it line by line :p

Edit:
I ran it through a random ID (from the list above) and code broke (self.get_phone())
https://i.gyazo.com/28aa0f56f300ad4ebdc3606f1a38e179.png
With another ID, it wasn't able to fetch website even though the listing had a website
Updated version, tested on all google ids in this thread:
https://repl.it/repls/IrritatingSimpleDatabase#main.py

Its mess written with one hand but it work :P
 
After discussing with @satyr85 over Skype, I will deem his solution accepted.
We both agree it wasn't the best/most pretty solution, but as the code works and gets the job done, he is the winner.

Congratulations! :)

For reference, here is my solution in C#:
79dc07d80c62d1775e38d8c7a170d779.png



Usage:
f4a6de2320a6ef84f60666e63ce6aa71.png


If anyone else wishes to post their solution for sport, feel free!
 
This looks pretty interesting to be honest. I wish I had seen this thread earlier lol. Congrats to @satyr85. :)
 
I got to trying to get the address. So there is a URL in the text response containing parts of the address such as the postcode, etc. I got the first part of the address earlier in the code, so I'm extracting all the URLS, to match up the URL with the address.

Borrowed the regex, and meta identifier to save time, but it appears not.. :D
Gotta increase my skills as well.

Interesting approach. Similar to @satyr85.
Any specific reason why you wanted to parse the URL over the JSON?
 
Interesting approach. Similar to @satyr85.
Any specific reason why you wanted to parse the URL over the JSON?
Well, I was trying to extract the data in a different way as I found the JSON to be harder to deal with. It had the details I needed for the address.
Maybe I would have used regex to get the phone number / website name based on the location, definitely a possibility.
 
I know it is late, but this is the php version of the same. :D

Just did it to prove the PHP isn't inferior or anything. :p

Code:
<?php

class googleListingScraper
{
    public $cid;
    private $gMapUrl = 'https://www.google.com/maps?cid=';
    private $businessData;
    public function getName(): string
    {
        return $this->businessData[11];
    }
    public function getStreet(): string
    {
        return $this->businessData[2][0];
    }
    public function getCity(): string
    {
        return $this->businessData[82][3];
    }
    public function getZip(): string
    {
        return $this->businessData[2][1];
    }
    public function getPhone(): string
    {
        return !empty($this->businessData[178]) ? $this->businessData[178][0][1][0][0] : "";
    }
    public function getWebsite(): string
    {
        return !empty($this->businessData[7]) ? $this->businessData[7][1] : "";
    }
    public function getOpenHours(): array
    {
        $hours = [];
        if (!empty($this->businessData[34])) {
            foreach ($this->businessData[34][1] as $key => $value) {
                $hours[] = [
                    $value[0],
                    $value[1][0]
                ];
            }
        }
        return $hours;
    }
    private function parseAppOptionsObject($js): void
    {
        preg_match('~window\.APP_INITIALIZATION\_STATE\s*\=\s*(.*?\s*?)\s*;\s*window\.APP_FLAGS~smi', $js, $match);
        $this->businessData =  json_decode(mb_convert_encoding($match[1], "UTF-8", "auto"));
        $correctedData = json_decode(ltrim($this->businessData[3][6], ")]}' "));
        $this->businessData = $correctedData[6];
    }
    public function parseAppOptions(): void
    {
        $html = file_get_contents($this->gMapUrl . $this->cid);
        $dom = new \DOMDocument("1.0", "utf-8");
        $dom->preserveWhiteSpace = false;
        $dom->formatOutput = false;
        $dom->loadHTML($html);
        $scripts = $dom->getElementsByTagName('script');
        foreach ($scripts as $script) {
            if (strpos($script->nodeValue, 'window.APP_INITIALIZATION_STATE') !== false) {
                $this->parseAppOptionsObject($script->nodeValue);
                return;
            }
        }
    }
    static function run($cid): array
    {
        $instance =     new static();
        $instance->cid = $cid;
        $instance->parseAppOptions();
        return
            [
                'name' => $instance->getName(),
                'street' => $instance->getStreet(),
                'zip' => $instance->getZip(),
                'city' => $instance->getCity(),
                'phone' => $instance->getPhone(),
                'website' => $instance->getWebsite(),
                'openHours' => $instance->getOpenHours(),
            ];
    }
}
echo json_encode(googleListingScraper::run(6798692991630522225));

Last line is the call..

Pretty much 2 hr job... so there can be bugs...
 
I know it is late, but this is the php version of the same. :D

Just did it to prove the PHP isn't inferior or anything. :p

Last line is the call..

Pretty much 2 hr job... so there can be bugs...

Seeeeexy! I like your way of writing PHP - much better than my version would be for PHP!

Here is a vanilla JavaScript entry I got via mail (disqualified as he wasn't a member of BHW)
Very straight-forward :D

46895a584a0ad5660604a1bc5362d40b.png
 
Seeeeexy! I like your way of writing PHP - much better than my version would be for PHP!

Here is a vanilla JavaScript entry I got via mail (disqualified as he wasn't a member of BHW)
Very straight-forward :D

46895a584a0ad5660604a1bc5362d40b.png
Thanks! Those 5 chars got me though. I have made some naming mistakes like the parseAppOptionsObject, which should really have been something related to Initialization. Plus I had set the businesData too soon in line 50. Nothing that can't be improved. ;)
 
Back
Top