Entry tags:
crawling comments on a community?
Hi!
I'm trying to crawl and parse comments on a community for a fandom event (http://hs-worldcup.dreamwidth.org , if you're curious). I've run into a bunch of issues and lack of API documentation, and talked to DW Support a couple of times, and feel like I am further away from successfully doing anything than when I started. Before I say anything else, here is What I Am Really Trying To Do:
If that is what I should do, how do I get around the adult content warning? Is there a flag I can pass with the URL or something? Do I need to do something more complicated than just using curl to grab the pages? Is there something I can pass to say "just give me one piece of HTML with all 5000 comments on it it will be easier for both of us probably?"
Thank you for any suggestions or advice you might have.
I'm trying to crawl and parse comments on a community for a fandom event (http://hs-worldcup.dreamwidth.org , if you're curious). I've run into a bunch of issues and lack of API documentation, and talked to DW Support a couple of times, and feel like I am further away from successfully doing anything than when I started. Before I say anything else, here is What I Am Really Trying To Do:
- take an individual community post (example: http://hs-worldcup.dreamwidth.org/3493.html#comments)
- download all of the comments with some sort of threading information --- the data I need in particular is comment subject, comment author, comment content, whether or not it's a reply and if so to what
- parse out that data and do transformations to it and add it to a database (which is not super relevant to this question I don't think but I can go into more detail if necessary)
If that is what I should do, how do I get around the adult content warning? Is there a flag I can pass with the URL or something? Do I need to do something more complicated than just using curl to grab the pages? Is there something I can pass to say "just give me one piece of HTML with all 5000 comments on it it will be easier for both of us probably?"
Thank you for any suggestions or advice you might have.
no subject
no subject
Check recent entries in this comm -- we're working on deprecating the existing API and replacing it with one that was, uh, conceived in the 21st century.
no subject
no subject
Well, my point is that with a bit of luck, the existing APIs will be deprecated within the next year or so and people who want to do things like what you're doing will be able to use the new API. So, while documentation is always a good thing, documentation of the XML-RPC API might be a bit of wasted effort!
Which is not to say "don't do it", just that it may be made obsolete fairly quickly.
no subject
no subject
I'm not aware of anybody actively working on documenting this better/at all. It's relatively little used; admittedly, possibly because it's mostly undocumented.
We're always happy to have people work on the docs. We've generally been using the wiki for documentation like this, and if you're interested in helping, I can point you places (or rather, I can ask some awesome folks to step in and help point you, since they are awesome and know the current taxonomy etc).