Cryptocurrency analysis and predictions using AI and big data

Discussion in 'CryptoCurrency' started by healzer, Jan 3, 2018.

  1. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    There are plenty of websites that allow you to access historical data through API or download a JSON/CSV file.
    However, all my data is collected in real-time, I did not use any historical data other than the ones I collected/scraped myself.

    I'll definitely check it out, thanks for letting me know about this!
     
  2. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    ========
    ======== Jan 9, 2017
    ========

    For the past few days I have been trying to isolate code in my Apache Spark setup.
    After several hours it just keeps crashing, without any error codes or clues.
    I think it may be an out-of-memory exception, some piece of code is accumulating memory and GC doesn't clean it.
    I've also upgraded to Spark v2.2.1 (I had 2.2.0) and it seems to be working faster, but still crashes (so it's most likely due to something in my own code).

    Apart from the above, I have spent quite some time messing with the machine learning part to make predictions.
    I have made several improvements to increase its accuracy and my own understanding of the algorithm.

    Here is a chart that is generated by Keras (machine learning framework):
    [​IMG]
    Green line: real BTC Price ($ USD). Range [0, 59] was used to train the system.
    Red line: prediction of BTC Price ($ USD). I have let it start predicting from x=51 (using data from the Green line). We can clearly see that the prediction algorithm follows the price's trend well until x=60, from there on it uses a mix of its own predictions and some historial data to make new predictions (= wild guesses).
    Yellow line: real Hype %
    Purple line: prediction of Hype %, here it follows the real Hype's trend until x=59, from then on it makes wild guesses.

    Luckily we can validate/compare the wild guesses to the reality:
    [​IMG]
    x=56 on the previous graph is the peak of "Jan 01, 02:00".

    When we compare the wild guesses to historial data we see that the price did indeed decrease.
    But it only started to go up again much later (at Jan 01 11:00) -- while our algorithm showed its increase much sooner.

    Same regarding the Hype: it did go up, but the prediction showed that it would reach a certain peak & go down, but in reality there was a very big spike which came later than predicted.


    We can conclude that short-term predictions are pretty accurate, but long-term ones (more than 3 hours for instance) are not.
    Either way I think this is some real solid progress already !! :D

    I hope I will fix that Apache Spark bug very soon so I can start gathering more data and then expand to many more coins.
    Thanks for reading! :)
     
  3. mindmaster

    mindmaster Jr. VIP Jr. VIP

    Joined:
    Sep 16, 2010
    Messages:
    3,687
    Likes Received:
    1,695
    Home Page:
    If you will ever release the bot and need beta testers, I am in.
    :)
     
    • Thanks Thanks x 1
  4. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    Thanks mate, I appreciate your kindness!
     
  5. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    =========
    ========= Jan 10, 2017
    =========

    Hi guys :)

    Got some good news for you today.
    Yesterday night I was looking at my social sentiment data and noticed an anomaly:
    The sentiment is designed to be within the interval of [100, -100] but here it was -1300.
    When I spotted this then went on fixing it. First I thought the data was incorrect, but the data was just fine phew...
    Fortunately the problem was in my own code for displaying the data on the graph.
    I wasn't able to spot this error before because it occurs only in specific cases (this was one of them).
    [​IMG]

    Having patched the bug I had a nice discovery...
    Now the sentiment trend looked completely different. Apparently this bug was causing problems the entire time but I couldn't see it.
    Have a look at this new version of the chart:
    [​IMG]

    It looks way better now, doesn't it. :)
    The above is data at one minute intervals, which isn't very reliable at such short intervals -- however it did appear to follow the increases/decreases in BTC price.
    Here is another region with 30min intervals:

    [​IMG]
    A few interesting regions:
    • from 16:00 to 19:00 the price was dropping, and we clearly see the sentiment go down as well.
    • at 21:00 the sentiment reached a local peak and afterwards the price had a nice bump as well.
    • We see the reverse scenario at 00:00 - 00:30
    • The sentiment peak at 02:30 is the reverse scenario (the price actually was going down, so it should've went down as well).
    I assume that the "reverse scenarios" are due to imperfections in how sentiment is calculated by our algorithm (still using a very basic/stupid sentiment analysis method).
    Even though we have a basic analysis method it does allow us to make up scenarios in some cases.
    The next step is to keep collecting more data and then add it to my Keras machine learning code and see whether it improves prediction accuracy.

    I still haven't been able to find the cause why Apache Spark is failing/hanging, but getting one step closer every time it fails :)
    Cheers!
     
    • Thanks Thanks x 2
  6. ZenFa

    ZenFa Newbie

    Joined:
    May 15, 2016
    Messages:
    1
    Likes Received:
    1
    This is a very interesting project. Looking forward to seeing how you'll improve it even further, but I have to say: your outcomes are already very impressive. :)
     
    • Thanks Thanks x 1
  7. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    =========
    ========= Jan 11, 2017
    =========

    Yet another all-nighter... what a day it was yesterday/today :)
    For over week I was pulling my hairs out, trying to figure out the Apache Spark bug.
    I was finally able to figure out how to fix, finally! :D
    Usually Spark crashed after 4 hours, but not today!
    [​IMG]

    ===== fixing Apache Spark crashes

    (If you don't care about Spark, you can skip the following block of text)
    I still don't know the root cause of it (no error logs or debugging info was I able to obtain).
    But in case anyone ever finds this post and has a similar issue with PySpark (version 2.2.0 or 2.2.1) then here's a short summary of the problem and how to fix it:
    After several hours my Spark just got stuck on a certain job and showed 2/4, it always got stuck at the same number -- not sure if it's related but I use exactly four "foreachRDD" functions to process RDDs from a Kafka's topic through directStream. I first thought that JVM was running out of memory, or there was a memory leak in my code, or a nasty bug in PySpark itself. After having ruled out every scenario and almost getting my IP banned by Google... but in the end he who seeketh shall find.
    When I first installed Apache Spark, I followed a guide - but as usual it was quite outdated. I had installed (through pip install) a quite recent version (2.2.0), and then used the spark submit command to launch the spark driver. But what I did not pay attention to was the "spark streaming kafka package". Apparently it must match your installed Kafka's/Zookeeper's version and Spark's version. By now I had already upgraded spark to 2.2.1, so my spark submit command looks like this (pay attention to the numbers):
    Code:
    spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.1  yoursparkfile.py
    2.2.1 is your Spark version
    Then you have to search the Maven website: https://search.maven.org/#search|ga|1|spark-streaming-kafka
    and you will see "spark-streaming-kafka-0-8-assembly_2.11". We need this one because the "...-0-10-..." is not available for PySpark (yet).
    My mistake was that both the Spark version and the Maven artifact numbers were outdated (I had it set to 2.0.0 like wtf). Unfortunately this is not very well documented on Spark's website, and even the many installation tutorials you find on Google don't explain this part very well.

    ===== Predicting Bitcoin's price using machine learning

    I've been busy toying with Keras machine learning all night until this morning.
    Previously our system only took two parameters: historical hype and price.
    I have now adjusted the code to use an "unlimited" number of new parameters on the fly.
    So I have added positive sentiment counts, and negative sentiment counts -- now it has 4 parameters to make predictions by.
    And since I had over 18 hours of consistent data (thanks to Spark), I have some nice graphs to show you:
    [​IMG]
    (higher resolution image: https://i.imgur.com/PxLvUBZ.png )
    *) this is a 10-minute interval chart (the x-labels are incorrect on the image) -- every point is exactly 10min apart.
    The x-axis's date and time starts at exactly: 2018-01-10 09:00am (EST timezone).

    The slightly faded lines (grayish and orange ones) represent respectively from top to bottom: the BTC price, social hype, pos/neg trend.
    The two latter ones don't really matter since the Y-scale is too large and I don't intend on predicting pos/neg sentiments either way.
    But the hype and price are of interest to me.
    I've drawn a green box to indicate which areas were generated/predicted by the machine.
    And as you can see they represent a theoretical extension of the real data.
    The system tells us that the hype will keep fluctuating up & down, while the price will steadily go up.

    However, this prediction is purely random and I chose it because I would like to see the price increase.
    In reality there are four core parameters I have to play with, and each one can have a completely different outcome in the prediction.
    These are: number of hidden neurons, batch size, number of epochs and window size.
    My next task is to study and learn how to find the optimal parameters, then I can make more realistic/accurate predictions.

    Thanks for reading! :D
     
    • Thanks Thanks x 1
  8. fredfredricks

    fredfredricks Registered Member

    Joined:
    Jan 2, 2017
    Messages:
    77
    Likes Received:
    11
    Best programmer in the world. Cheers
     
    • Thanks Thanks x 2
  9. Bel1616

    Bel1616 Jr. VIP Jr. VIP

    Joined:
    Jan 6, 2017
    Messages:
    240
    Likes Received:
    50
    Gender:
    Male
    I'll wait for the release, if you do
     
    • Thanks Thanks x 1
  10. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    =========
    ========= Jan 12, 2017
    =========

    Yesterday I've spent a few hours playing with the parameters of the neural network (RNN).
    Primarily I've tested differences in outcomes in window sizes and the number of features/parameters (e.g. social signals, sentiments, ...) to predict the price.
    Have a look at these graphs:

    *) My RNN can make predictions based on sequences, e.g. given data from past t:3 hours it then predicts t+1
    *) "Window size" means how many time intervals (hours) to look back (= size of the sequence).

    Windows size = 1, and one parameter (only price):
    [​IMG]

    Window size = 1, and using 2 features (social mentions and price)
    [​IMG]

    From the two above charts we see that the predicted values are very smooth. So using a window size of 1 is not a good prediction model. The reason is that each prediction has a certain error, and in each prediction we use the previous prediction to compute the next -- so the error grows exponentially (error upon error).

    Let us use a window size of 20 and only one feature (the price):
    [​IMG]
    This looks way more realistic doesn't it.

    And now with 2 features: price and social mentions (window remains 20):
    [​IMG]
    This looks even better.

    Let us try using 4 features: price, social mentions, negative sentiment count, positive sentiment count (window = 20):
    [​IMG]
    * this graph has an extra known data-point (while all above graphs do not), this is my mistake because the system always uses most recent data. But it has no impact on what follows.

    Now, how did these predictions compare to the reality?
    After a few hours I had the real price so I could compare it against the predictions.
    Here is the real price's graph:
    [​IMG]
    (I have hidden all other graphs because there were gaps due to Spark crashes).

    Now I have used a bit of Photoshop to put the real data on top of the predictions.

    Here is the chart with 2 features:
    [​IMG]

    And here's the one with 4 features:
    [​IMG]

    We see that the predictions using 2 features resembles the real price better (in shape) -- however, the amplitude of its peaks is lower than the real data.
    Nonetheless this is some good news, it has two interesting regions, it has a peak that matches that of real data and afterwards a nice drop that is similar to the drop of real data.

    I thought the graph with 4 features would be even a better representation of the real data, but not really.
    It was a bit of random rises & drops that don't follow the price at all. Maybe I need to tweak some other parameters to make it predict better.
    As I was writing this post I received my package:
    [​IMG]
    I think I'll learn a lot useful stuff from this book (hopefully).


    =========== Apache Spark, again....

    I thought I had fixed the Spark thing, but after 23 hours and 42 minutes it crashed on me again.
    I've got in touch with a few Spark experts, but no answers yet.

    At 6am this morning I've ordered a new server from Digitalocean (insta-setup thank god), and an hour later I had a new setup of my environment up and running.
    I've also followed different tutorials/methods on setting up Spark and various components. Maybe this setup won't cause any problems, now time will tell.

    Have a great day!! :D
     
    Last edited: Jan 12, 2018
  11. cnick79

    cnick79 Senior Member

    Joined:
    Jun 10, 2010
    Messages:
    830
    Likes Received:
    421
    Location:
    Wandering
    This is great Work! I’m surprised the prediction using the 4 parameters didn’t yield better results. Something doesn’t seem right about that. I’m sure you will figure it out!
     
    • Thanks Thanks x 1
  12. Lothric

    Lothric Regular Member

    Joined:
    Apr 25, 2017
    Messages:
    204
    Likes Received:
    51
    how will you handle the emotional part of market?
    and there's no way you could scrape billions of net data and effectively process data in real time, you have to find a way to build extremely robust data fetching system from thousands of sites before making any AI
    real time data from social medias, google, youtube, even telegram.. don't you think it's already too much? plus you can't scrape google that easily as they use botguard which uses bytecode browser fingerprinting
    there's no way you could accurately predict the price, it's way way too heavy project for you.. or not even small companies.. maybe company like DeepMind could do it
    i'd rather focus on making something like newswire
     
    Last edited: Jan 12, 2018
  13. ttmschine

    ttmschine Power Member

    Joined:
    Mar 27, 2013
    Messages:
    631
    Likes Received:
    359
    I think it's great that you're putting so much effort into an intelligent-ish idea, but your whole premise is flawed I'm afraid.

    Firt all "technical analysis" is retrospective, it has to be by it's very nature, which means that you're always going to be behind the curve of what's really happening.

    You're hoping that what's happened before will be a precursor to what happens again in the future, but in the real world it just doesn't happen like that - you're not dealing with machines or equations that always give the same response and output, you're dealing with markets driven by people who are a big bag of hormones, emotions, fears, worries, greed, needs, etc.

    It's unpredictable.

    Second is that no matter how much data you process you can't predict 2 planes flying into the WTC one morning, or the POTUS being assassinated, or a major accident, or incident, a Tsunami, whatever it might be.

    And like it or not these events have profound consequences, and effect markets dramatically.

    Plus with cryptos think about the unknown unknowns (to coin a phrase) that could effect it.

    It's unregulated so the banks can't touch it, yada, yada, yada so it's safe - but all "they" need to do is go for a big chunk of the infrastructure - something you might never have thought of, like the US Govt putting huge pressure on China to shut down the miners - then the whole thing grinds to a halt sine the actual processes that make the thing (kind of ) work are eliminated.

    Remember that China hold Trillions of Dollars of US Debt, so if cryptos did effect real world currency values the Chinese would be as mashed up as the USA - it is in no governments interest to let that happen so they will collude.

    Plus of course one bad apple spoils the whole batch - one shitcoin goes pop, it may well undermine confidence in the whole blockchain phenomena, or the "pump dump" gangs are arrested and charged with fraud, which implies to Joe Public the whole thing is a fraud, and etc, etc, etc.

    Any one of multiple things happens, a lot of which we can't even conceive of and the whole house of cards collapses - and "technical analyisis" which is essentially what you're doing can't measure, predict, or prepare you for that.
     
    • Thanks Thanks x 1
  14. Bel1616

    Bel1616 Jr. VIP Jr. VIP

    Joined:
    Jan 6, 2017
    Messages:
    240
    Likes Received:
    50
    Gender:
    Male
    Have you seen any screener that is already doing this? Because with this thread you can make a lot
     
  15. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    @Lothric @ttmschine
    Hi guys,
    Thank you for your honest perspective on this project, I appreciate you joining in on the discussion :)
    There is so much I would like to say, but I'll keep it very short and brief.

    The whole purpose of financial analysis and making predictions is not to make accurate predictions.
    The goal is to increase chances of winning; This is done by broker firms, professional investors, etc...
    They all use data, math/stats to go from 50/50 chance to 55/45. Gambling is a pure 50-50 scheme (unless it is rigged).
    But investments not, except for pump & dump ICOs.

    If you read through the comments of other people, you'll see there are some really nice inputs from smart people.
    E.g. each coin is in a certain stage -- so the idea is to categorize coins by stage. Because as it goes from "stage" A to B, it goes up/down in value.
    Bitcoin today is a far cry from what it was in 2014. And altcoins are a far cry from where Bitcoin is now, but some are where bitcoin was in 2014.
    The whole idea is to detect changes in "stage". E.g. A year ago Ripple was barely talked about, until recently it was a hot subject.

    Everyone knows that accurate predictions are impossible, it is not my goal, nor should anyone try to do this.
    However, detecting a viral tweet from Bill Gates who says he's going to invest in ETH/Ripple/whatever, will have a major influence on its price.
    Finally, maybe you have heard of "back testing", this allows us to carefully analyze our short-term predictions and see how many times you would make a profit/loss, and if that is above 51% then we have done more than a great job.

    Finally, I am not going to scrape every dark corner of the internet to find every single piece of content about a certain coin.
    You don't need to analyze the entire ocean to detect radio active material. If something is "off" then it leaves traces everywhere.
    We don't need to capture every single tweet about BTC to show a trend of how popular the topic was today versus yesterday. Even if we capture 1% of its tweeted-volume, we will see which day it had more virality.

    @Bel1616
    I did look at various coin screeners, but none do what I am doing.
    As others have mentioned, there "Solume" which is a very basic version: they do sentiment analysis from Reddit posts and Tweets, but that's about it.

    =========
    ========= Jan 13, 2017
    =========
    Today was less productive.
    We detected a leak on our roof, so I spent several hours patching that up.

    Yesterday I stayed up very late because Spark crashed again, but I detected something interesting that might be "the bug". I'll keep you updated tomorrow :)

    Have a great weekend all!
     
    • Thanks Thanks x 4
  16. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    =========
    ========= Jan 14, 2017
    =========

    Yesterday I have spent almost all day on debugging Apache Spark.
    I have setup two brand new servers/droplets and installed everything from scratch.
    I have also read that Pyspark does not work well with Python3.6, so maybe that's the cause -- so I kept it using v3.5 instead.
    Right now both setups have been running for over 11hours without a problem.
    --

    I have also made a small but important change to the trendline's function.
    You may have noticed that in previous versions the trendlines were calculated from left to right.
    So this means if (n_points % n_trendline != 0) then the graph would simply append the remaining point(s) at the end.
    However, the "end" (most right) of a graph is the most important part, so we should definitely include it in the general trend.
    To solve this I make it calculate the trendlines from right to left, so any remaining points will be appended at the start:

    [​IMG]
    * 5 point based trendlines.
    * notice the very first point (at the very left) which is appended because 26%5==1

    Extra.
    Since the very beginning, I have only been showing graphs for Bitcoin (BTC).
    However, early on I also added ETH into the process, so here's a live view on Ethereum's price:
    [​IMG]
    * the date time is GMT+1
    * the social hype looks like the exact inverse of the price's trend (8 point trendline)

    Have a great day all!
     
    • Thanks Thanks x 1
    Last edited: Jan 15, 2018
  17. bartosimpsonio

    bartosimpsonio Jr. VIP Jr. VIP

    Joined:
    Mar 21, 2013
    Messages:
    14,483
    Likes Received:
    12,947
    Occupation:
    MACHIN LURNIN
    Location:
    TUVALU
    Home Page:
    This is probably one of the best crypto discussions on the entire internets. Great work OP.
     
    • Thanks Thanks x 2
  18. KHer0

    KHer0 Jr. VIP Jr. VIP

    Joined:
    Mar 22, 2011
    Messages:
    1,630
    Likes Received:
    1,516
    Gender:
    Male
    Wow, I am sure a whale company will notice this thread and buy the program for a lot of zeros.

    Side tip: try vultir instead of DO. It's a lot cheaper and faster. You will thank me later :)
     
    • Thanks Thanks x 1
  19. healzer

    healzer Jr. VIP Jr. VIP

    Joined:
    Jun 26, 2011
    Messages:
    2,950
    Likes Received:
    2,934
    Gender:
    Male
    Location:
    Somewhere in Europe
    Thanks :)
    I have been reading a lot of discussions of DO vs linode vs Vultr, etc...
    For now I will stick with DO since I've been a loyal customer for many years.
    Their support team has always helped me when I needed and never disappointed me.
    But for scaling I may use a combination of Vultr (which is cheaper) to do raw data processing on, and DO for hosting.
    I haven't done any testing myself to say whether DO is faster than Vultr or vice verse, but might look into that soon :)
     
  20. DigitalAdvanced

    DigitalAdvanced Registered Member

    Joined:
    Jan 8, 2015
    Messages:
    99
    Likes Received:
    53
    You obviously put some serious time and energy into this project. I totally believe you are on the right track in terms of gathering data and information based on what people are discussing from various resources. The same is true when working with the stock market. The thing with crypto is that it is at a point where it's not a new idea. At first it had it's risks and such and few people knew about it, but now everyone knows. So what is the progressive psychology of the buying and selling patterns of this type of concept? That is what I am wondering as it will influence where people gather their information, what information they are being fed, and how their social networks influence their decisions towards cryptocurrencies, and which currencies stand the test of time and which don't. Bitcoin will eventually become the AOL of crypto I am sure, but who will replace it and why, and what social influencing factors will dictate which currencies become the standards.
     
    • Thanks Thanks x 1