1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Parsing Datafeeds

Discussion in 'Blogging' started by Tireswing, May 9, 2010.

  1. Tireswing

    Tireswing Newbie

    Joined:
    Mar 22, 2009
    Messages:
    9
    Likes Received:
    0
    Hey guys,

    Been playing with some datafeeds for a few new projects. Basically, I'd like to be able to match products from across vendors. The problem is, obviously, identifying product x at vendor y and at vendor z. At first, I was hopeful that the feeds would contain a manufacturer ID number or something, but most of the products I'm looking at don't have that.

    Is my only option to parse the titles and do my best to match them up?

    Anybody have any experience doing this and might be willing to share a few tips or tricks?
     
  2. netfish

    netfish Junior Member

    Joined:
    Mar 5, 2010
    Messages:
    106
    Likes Received:
    33
    Occupation:
    Software Engineering: Javascript, CSS, HTML5, PHP,
    Location:
    Baltimore, MD
    What format is the Datafeed in? RSS?

    There are libraries that handle that line of work for you programmatically, ie:

    Code:
    http://simplepie.org/
     
    Last edited: May 9, 2010
  3. C.S.A.

    C.S.A. Junior Member

    Joined:
    Mar 29, 2010
    Messages:
    150
    Likes Received:
    242
    Do your best to match them programmatically, and then clean up your database to the required level of correctness using amazon's mechanical turk. This is EXACTLY the job it was created for.

    Let me know if you need any help with getting that going, or if you really don't want to deal with it, we could probably work out something using my existing frameworks for interfacing with the service.
     
  4. Tireswing

    Tireswing Newbie

    Joined:
    Mar 22, 2009
    Messages:
    9
    Likes Received:
    0
    Okay, so it really seems like the only way is to match them up. The niche that I'm focusing on doesn't have a lot of changes year-to-year in terms of products, so I'm wondering what the best approach will be to creating a parser.

    At first I was thinking that I could check each product as I parse it from the feed against all the products that are currently stored, but I realize that would create a ridiculous number of mysql queries with lots of overhead. Is there a better way to do this?
     
  5. Tireswing

    Tireswing Newbie

    Joined:
    Mar 22, 2009
    Messages:
    9
    Likes Received:
    0
    About 95% of the products I'm looking at will be unmatchable based on precise criteria. What I'm going to have to do is create a parser that breaks down the title of each item in the feed into the actual item name (hopefully) and the other, non-standardized descriptors.

    If you see my above post, I have a few ideas on how to do it, but I really don't know what will be most efficient or precise.

    The good news is that the feeds I'm using are pretty much internally consistent. Feeds from merchant x always have the title formatted one way and feeds from merchant y have them formatted another way.
     
  6. netfish

    netfish Junior Member

    Joined:
    Mar 5, 2010
    Messages:
    106
    Likes Received:
    33
    Occupation:
    Software Engineering: Javascript, CSS, HTML5, PHP,
    Location:
    Baltimore, MD
    Separate them by passing through other data structures and crunch them up like you want, it'll help you divide-and-conquer it.
     
  7. C.S.A.

    C.S.A. Junior Member

    Joined:
    Mar 29, 2010
    Messages:
    150
    Likes Received:
    242
    I don't know how big your databases are, but you may want to investigate using something like sphinx to help you do the comparisons.

    http://www.sphinxsearch.com/

    It's a WHOLE lot better than using SQL LIKE commands.
     
  8. Tireswing

    Tireswing Newbie

    Joined:
    Mar 22, 2009
    Messages:
    9
    Likes Received:
    0
    Hmm.. So with something like this, I could dump all my products into my database and then go back and match them up based on search criteria.

    While I'm at it, I have another question.

    Right now I basically have three tables for this project.

    Products -- table full of the products. Contains pictures, descriptions, manufacturer ID (if there is one) and SKU. For the SKU I append the name of the vendor onto the front of it to make it unique.

    Vendors -- Name, ID number, Logo, FeedURL

    Prices -- SKU, Vendor ID, retail price, sale price, manufacturer ID.

    I had originally done things this way under the assumption that, as I parsed the feeds, I would determine which items were duplicates and simply attach multiple prices to that item. So, when I queried that table for the price of AwesomeThingy, I'd get price entries from every vendor that had it.

    My question is (1) does this make sense and (2) if I'm going to go back through and try to match up items what is the best way to accomplish this?

    Thanks for your help guys.
     
  9. macpaulos

    macpaulos Regular Member

    Joined:
    Oct 14, 2009
    Messages:
    295
    Likes Received:
    53
    Very interested on how you got on with this project. Did you find a solution or did you end up having to manually edit everything?
     
  10. MisterGemini

    MisterGemini Senior Member

    Joined:
    May 25, 2010
    Messages:
    1,113
    Likes Received:
    714
    Occupation:
    Observe & Report
    Location:
    Alternate Universe
    You might want to consider looking at something like a comparison site script to help you sort that out.

    The handful that are out there have a built in datafeed mapper that would make it possible to match up like you want.

    You could load your DFs into something like that, then simply extract the data from the mysql to generate a new DF. :)

    Umm.. everyone has their favorites, but I believe you can get a free trial of DataFeed Studio to get this done.

    Otherwise, hope you figured something out by now.
     
  11. safety101

    safety101 Newbie

    Joined:
    May 25, 2010
    Messages:
    24
    Likes Received:
    0
    it is RSS................