Poll of the Day > Dear python developers of PotD

Topic List
Page List: 1
AwesomeTurtwig
11/25/18 2:58:46 AM
#1:


For the love of god, how the hell do I scape this webstie.

https://www.streetinsider.com/dividend_history.php?q=DNKN

I've tried many different methods but everytime the site know's I'm a damn bot.

I tried changing user agent. Nah didn't work. Tried using a headless browser. NOPE.

Can some someone break the code and figure out how to snag the content off this site?

I might have a reward for you if you can figure it out.

Robots for reference.

https://www.streetinsider.com/robots.txt
---
... Copied to Clipboard!
Yellow
11/25/18 3:38:24 AM
#2:


Are you doing a neural network sort of deal?

Do they allow proxies? Why not write a program that actually hijacks your own "trusted" browser, then open multiple instances of that hijacked browser connected directly to individual proxies? If you find a good proxy you should be able to scape at a good rate.

If you still fail at that point you should be able to spoof certain inputs to make your robot seem more "human" anyway.

That's my "I don't use Python so I don't know what you're doing wrong" approach. The only scraper I wrote was in VB, and I didn't fight sites that didn't want me there. :P
---
... Copied to Clipboard!
AwesomeTurtwig
11/25/18 3:55:09 AM
#3:


It looks like it sets a key in the users cookies. This is not trivial to parse. Finding anyother database might be easier.
---
... Copied to Clipboard!
Yellow
11/25/18 4:06:27 AM
#4:


AwesomeTurtwig posted...
It looks like it sets a key in the users cookies. This is not trivial to parse. Finding anyother database might be easier.

South Africa
https://www.kaggle.com/geoffnoble/sadividends

1.4 MB
https://www.kaggle.com/jonnylangefeld/explore-dividends

Can I ask why you need these dividends? I'm assuming you're trying to use the data to help predict stock price changes?
---
... Copied to Clipboard!
AwesomeTurtwig
11/25/18 4:07:02 AM
#5:


It's easier to scrape this.

https://www.nasdaq.com/symbol/dnkn/dividend-history

No div yield, or dollar amount. But you can calculate it with the data in that table at the very least.
---
... Copied to Clipboard!
Yellow
11/25/18 4:08:27 AM
#6:


Why do you need the data?
---
... Copied to Clipboard!
AwesomeTurtwig
11/25/18 4:10:09 AM
#7:


Yellow posted...
Can I ask why you need these dividends? I'm assuming you're trying to use the data to help predict stock price changes?

Homework for my datamining course.
---
... Copied to Clipboard!
Yellow
11/25/18 4:21:54 AM
#8:


My inner evil mastermind cringes at the idea of giving your mined financial data set to your professor for free.

https://www.youtube.com/watch?v=Wf66LjGuoRw" data-time="

---
... Copied to Clipboard!
Sahuagin
11/25/18 4:47:49 AM
#9:


hmmm, yes if I do a an http get request from code, I get back a "we think you're a bot" page.

a suggestion here is to use "selenium web driver" (hmmm, though that's probably what you mean by "headless browser")

https://security.stackexchange.com/questions/71869/bot-detection-via-browser-fingerprinting

maybe you could do it with a userscript?

for example, the div that holds the main table has id "content", so you can pull out its html with $("#content").html();

you could then use ajax to write the html to a server, and then process it however you like

(or you could do your processing in javascript, pull data out of the html and send that back to the server)
---
... Copied to Clipboard!
Yellow
11/25/18 4:56:44 AM
#10:


Fight bots with bots. Make a neural network to generate human input. Simple enough...?

Sahuagin posted...
"selenium web driver"

I was wondering if something like that existed. This is why you're my favorite poster.
---
... Copied to Clipboard!
Judgmenl
11/25/18 7:17:31 AM
#11:


Use a headless browser:
https://duo.com/decipher/driving-headless-chrome-with-python
---
Judge, Nostalgia is a hell of a drug.
You're a regular Jack Kerouac
... Copied to Clipboard!
AwesomeTurtwig
11/25/18 9:29:12 AM
#12:


Sahuagin posted...
a suggestion here is to use "selenium web driver" (hmmm, though that's probably what you mean by "headless browser")

Yeah, I've tried Selenium. But it doesn't work.
---
... Copied to Clipboard!
Sahuagin
11/25/18 12:14:12 PM
#13:


some info here about why selenium can still be detected:

https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver

for the userscript solution, you should be able to run a local webserver and use http post to send any data you've extracted from the page. not something I've done before so don't know for sure it'd work.
---
... Copied to Clipboard!
Topic List
Page List: 1