1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

How do I? a = b + c in Python.

Discussion in 'General Programming Chat' started by elavmunretea, Jan 22, 2017.

  1. elavmunretea

    elavmunretea Elite Member

    Joined:
    May 14, 2016
    Messages:
    1,595
    Likes Received:
    2,159
    Home Page:
    Hi there,

    For my first python project, I made a web scraper (Like a Sitemap)

    I then decided that I would modify it to make a range of Instagram bots. The one that I'm currently setting up is scraping people who posted with a specific #.

    I have it fully working, but the user has to enter the full URL every time, whereas I only want them to enter the #

    The code I have currently looks like this:
    Code:
    tag = input("Please enter a Tag:")
    url = "https://www.instagram.com/explore/tags/" + tag
    but it doesn't work. I have tried a lot of things like
    Code:
    tag = input("Please enter a Tag:")
    tagurl = "https://www.instagram.com/explore/tags/"
    url = ['tagurl','tag']
    And a bunch of other stuff changing out the spacing, etc but I can't get it to work.

    I would really appreciate the help on this one. I could even pay if you really want.
     
    • Thanks Thanks x 1
  2. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,493
    Likes Received:
    11,193
    Occupation:
    CHEAP
    Location:
    DATASETS
    Home Page:
    The return from input probably includes the newline character. Are you cleanin up the return so it has no invisible stuff after it?
     
  3. elavmunretea

    elavmunretea Elite Member

    Joined:
    May 14, 2016
    Messages:
    1,595
    Likes Received:
    2,159
    Home Page:
    Something like tag.strip() ?
     
  4. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,493
    Likes Received:
    11,193
    Occupation:
    CHEAP
    Location:
    DATASETS
    Home Page:
  5. elavmunretea

    elavmunretea Elite Member

    Joined:
    May 14, 2016
    Messages:
    1,595
    Likes Received:
    2,159
    Home Page:
    It's really weird.

    Before it was giving me an error, but now
    Code:
    tag = input("Please enter a Tag:")
    url = "https://www.instagram.com/explore/tags/" + tag
    works fine. Maybe there was an error with saving the output to a file or something, I don't know.

    Anyway, thanks for the help. All that's left to do is a simple GUI now :)
     
    • Thanks Thanks x 1
  6. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,493
    Likes Received:
    11,193
    Occupation:
    CHEAP
    Location:
    DATASETS
    Home Page:
    Nice. U using scrapy?
     
  7. elavmunretea

    elavmunretea Elite Member

    Joined:
    May 14, 2016
    Messages:
    1,595
    Likes Received:
    2,159
    Home Page:
    Just some YouTube tutorial I watched to scrape OBLs. The code is like this:
    Code:
    import urllib
    from bs4 import BeautifulSoup
    import urlparse
    import mechanize
    import csv
    
    tag = input("Please enter a Tag:")
    url = "https://www.instagram.com/explore/tags/" + tag
    #url = "https://www.instagram.com/explore/tags/example"
    br = mechanize.Browser()
    urls = [url]
    visited = [url]
    while len(urls)>0:
        try:
            br.open(urls[0])
            urls.pop(0)   
            for link in br.links():
                newurl = urlparse.urljoin(link.base_url,link.url)
                b1 = urlparse.urlparse(newurl).hostname
                b2 = urlparse.urlparse(newurl).path
                newurl =  "http://"+b1+b2
    
                if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
                    urls.append(newurl)
                    visited.append(newurl)
                    print newurl
                    
        except:
            print "error"
            urls.pop(0)
           
    Then I remove all lines containing /p/ (picture links), /explore/ (other #s), /location/, error (for the error part of the script) and other non-user urls like:
    http://www.instagram.com/accounts/
    http://www.instagram.com/about/
    http://www.instagram.com/press/
    http://www.instagram.com/developer/
    http://www.instagram.com/legal/privacy/
    http://www.instagram.com/legal/terms/
    http://www.instagram.com/about/directory/
    http://www.instagram.com/download/instagram/

    I'm setting it up to dynamically change depending on if you want to scrape followers, likers, posters to a #, posters to a location etc.

    Good first main project as it has the possibility of implementing other useful things like outputting to CSV, proxy rotation, multi-threading etc
     
    • Thanks Thanks x 1
  8. MoneyEagle

    MoneyEagle Regular Member

    Joined:
    Nov 6, 2015
    Messages:
    328
    Likes Received:
    146
    Gender:
    Male
    Occupation:
    Internet Marketing
    Great going man!
     
  9. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP Premium Member

    Joined:
    Mar 21, 2013
    Messages:
    12,493
    Likes Received:
    11,193
    Occupation:
    CHEAP
    Location:
    DATASETS
    Home Page:
    Pretty cool stuff. Will play with this later ;)
     
  10. elavmunretea

    elavmunretea Elite Member

    Joined:
    May 14, 2016
    Messages:
    1,595
    Likes Received:
    2,159
    Home Page:
    Thanks.
    Yeah it's really interesting!

    It's 3am here, so I don't have time to add a GUI today.

    Here's the code so far:

    Code:
    import urllib
    from bs4 import BeautifulSoup
    import urlparse
    import mechanize
    import csv
    
    print("What would you like to scrape from?")
    print("a) Hashtag")
    print("b) Other")
    answer = input("Make your choice: ")
    
    if answer == "a":
        urlb = input("Please enter a Tag:")
        urla = "https://www.instagram.com/explore/tags/"
    
    else:
        print("Coming Soon")
    
    url = urla + urlb
    file = open(urlb + '.txt', 'w')
    br = mechanize.Browser()
    urls = [url]
    visited = [url]
    while len(urls)>0:
        try:
            br.open(urls[0])
            urls.pop(0)   
            for link in br.links():
                newurl = urlparse.urljoin(link.base_url,link.url)
                b1 = urlparse.urlparse(newurl).hostname
                b2 = urlparse.urlparse(newurl).path
                newurl =  "http://"+b1+b2
    
                if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
                    urls.append(newurl)
                    visited.append(newurl)
                    file.write(newurl + "\n")
        except:
            file.write("error\n")
            urls.pop(0)
    
    file.close()
          
    print "Complete"
    
    I added file output with a custom name dependant on the Tag. I was going to do location as well but unfortunately, each location has a number (Ie ig.com/location/1234/nowyork) and I haven't found a way to get this easily.

    I set up this script that will be activated when the user clicks "Stop" on the GUI (Will make tomorrow)


    Code:
    bad_words = ['error', '/p/','explore','locations','accounts','about','press','developer','legal','download']
    
    with open(urlb + '.txt') as oldfile, open(urlb + 'final.txt'', 'w') as newfile:
       for line in oldfile:
           if not any(bad_word in line for bad_word in bad_words):
               newfile.write(line)
    
    Then I will remove the "https://instagram.com/" from the start and the "/" from the end.
     
    • Thanks Thanks x 1
  11. ѕмarтgυy

    ѕмarтgυy Newbie

    Joined:
    May 21, 2016
    Messages:
    29
    Likes Received:
    14
    Gender:
    Male
    Occupation:
    Surfer
    Location:
    Antarctica
    Home Page:
    Better do it like this :
    Code:
    tag = input("Please enter a Tag:")
    url = "https://www.instagram.com/explore/tags/{}".format(tag)
     
    • Thanks Thanks x 3
  12. gman777

    gman777 Jr. VIP Jr. VIP

    Joined:
    Apr 7, 2016
    Messages:
    695
    Likes Received:
    576
    So you use beautifulsoup to scrape data. Nice man. This thing really motivated me to do again python.

    It doesn't seem difficult to understand the notions used in your program.

    I was planning to create a bot to scrape the first 10 results from google for a particular keyword, and check if the given keyword is found in title, description, or permalink etc. + other factors and then based on the given value it would result in my own metric...yeah that would be cool..

    Anyway, thanks...
     
    • Thanks Thanks x 1