[CITASA] seeking suggestions about research methodology for online forums
Caroline Haythornthwaite
haythorn at illinois.edu
Fri Feb 26 05:30:21 CST 2010
I forwarded Tim Hale's question to Anatoliy Gruzd at Dalhousie, but others may also be interested in his "TextAnalytics" system (see http://anatoliygruzd.com/home/?page_id=27). This provides a home for text analysis and network analysis of threaded discussions.
/Caroline
---- Original message ----
>Date: Thu, 25 Feb 2010 19:34:05 -0800
>From: Christine Morton <christine at christinemorton.com>
>Subject: Re: [CITASA] seeking suggestions about research methodology for online forums
>To: "Thomas M. Lento" <thomas.lento at gmail.com>, Tim Hale <timhale at uab.edu>
>Cc: CITASA at list.citasa.org
>
> Wow. Thank you for the question, Tim, and for
> sharing the outline of your research. It sounds
> very interesting and relevant. Would you be
> interested in sharing more details with me offlist?
> and thanks Tom for your thoughtful and thorough
> reply. I've learned a lot.
> Regards,
> Christine
>
> On 2/25/10 6:54 PM, "Thomas M. Lento"
> <thomas.lento at gmail.com> wrote:
>
> There are all kinds of supervised and
> semi-supervised content analysis approaches that
> might be applicable - their effectiveness depends
> on the nature of your data, the level of detail
> you require, and your ability to design training
> data. For example, if you have a superset of
> health information topics and keywords associated
> with them you could probably do a simple keyword
> analysis to give you an idea of which posts are
> talking about which topic. The advantage is that's
> pretty easy to do even if the data is stored as
> text blobs in your MySQL database. However, if you
> need to do serious contextual analysis or you
> don't already know the scope and range of possible
> health topics and keywords that's a more difficult
> problem. You'll probably need to try some
> different machine learning approaches and see what
> works best for you needs. You can do literature
> searches for papers published at ICWSM, KDD, or
> WWW for examples of some approaches. There are
> other conferences that will be applicable, but
> those are good places to start. Follow the
> citation trail from the relevant pieces and you
> should get some idea of where to look for useful
> references.
>
> Network analysis of these things can either be
> pretty straightforward, moderately complicated, or
> totally impossible. It all depends on how your
> data is structured and what type of information is
> available. If you've got a standard relational
> database structure designed to drive the content
> in the online community then at best you're in for
> some work to reformat the data tables into
> something useful for actual analysis. What you
> hope for are tables with various combinations of
> userid, postid, time stamp, user_data, and
> post_data. You'll probably need to join several
> tables to get the actual data files you need, and
> although MySQL is not optimized for joins it
> should still be manageable.
>
> Assuming you're new to this type of analysis, my
> advice is to spend a lot of time familiarizing
> yourself with the underlying data. Do some basic
> distributions and see how much noise you've got in
> the system, find out the most efficient routes to
> generating the output you need, and discover what
> information is available in which table and how
> those tables are keyed and indexed. Make sure
> you're on the lookout for garbage data - a lot of
> the great data you get from various online sources
> is basically bad, either because of system errors
> (rare and usually easy to find) or "bad" users
> (common and not so easy to find - there's a whole
> literature on spam detection algorithms out
> there). You'll need to make decisions about your
> error tolerance and what types of behaviors you
> wish to ignore, and you can only do that
> effectively if you understand your data.
>
> I don't know how much experience you have with
> database queries, but assuming you can only handle
> moderately complex queries my advice is to use the
> database as a source for your final dataset and
> then conduct your analysis in some other tool. My
> typical approach to this situation is to write
> database queries that produce flat text files with
> one row per observation (typically per user, or
> per user/time_period combination, but this
> obviously depends on your research question) with
> one column for each metric. Then I load the data
> into R or Stata or whatever else and build models.
> You can do a fair amount of work in MySQL, but
> this is typically slower and more difficult than
> exporting and using an actual statistical package.
>
> The drawback of exporting data is that you're
> limited to whatever your stats package can hold in
> memory. If your data set is large (hundreds of
> thousands or millions of observations) then you
> need a fair amount of memory to run any kind of
> complex model. If you're dealing with 10s or 100s
> of millions of observations in your model then
> things get really interesting - I suggest
> sampling, but there are other more difficult
> options.
>
> If you don't know MySQL at all, you need to learn
> it. There are a plethora of books on MySQL out
> there - I like O'Reilly for reference and SAMS for
> instruction, so if you get something from one of
> those publishers you should be ok. You will also
> want to learn how to do some scripting in Python
> or Perl. For the rank beginner, I recommend going
> with Python and learning by working through the
> chapters and exercises in How To Think Like a
> Computer Scientist (free online at
> http://www.greenteapress.com/thinkpython/thinkCSpy/html/
> ). If you know how to program in general then
> diveintopython.org <http://diveintopython.org> is
> your best bet. I'm not a Perl guy, but I'm sure
> someone can point you to resources.
>
> For the actual network analysis, I'd first look
> into NodeXL since it's easy to use, and if that
> doesn't meet your needs I'd go with something like
> igraph (see http://igraph.sourceforge.net/ ),
> which works with both R and Python.
>
> Best of luck.
>
> -Tom
>
> On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale
> <timhale at uab.edu> wrote:
>
> Hi everyone,
>
> I am working with others on a project that will
> examine health communication among members of an
> online community who post to an online forum. We
> have three primary goals: (1) to conduct a
> content analysis to understand the types of
> health information that is communicated; (2)
> identify the context of the discussions,
> including understanding the characteristics of
> the individuals who initiate and disseminate
> health information; and (3) to conduct a social
> network analysis to examine the larger
> structures of health information sharing among
> community members.
>
> Although this type of research could be
> conducted by manually collecting posts from the
> online forum, coding for content, and the
> creation of a data set for social network
> analysis -- we are interested in other
> approaches that make better use of the forum
> database files. We have the cooperation of the
> website owner and administrator to access the
> MySQL database.
>
> I am seeking advice from anyone with experience
> working on similar research questions involving
> online forums and especially, making use of the
> original forum database files. All
> recommendations, suggestions, and pointers to
> articles, books, and appropriate tools are
> welcome and greatly appreciated.
>
> Thank you,
> Tim Hale
>
> ------------------------------------------------------------
> Timothy M. Hale, MA
> University of Alabama at Birmingham
> Department of Sociology
> Heritage Hall 460E
> 1401 University Boulevard
> Birmingham, AL 35294-1152
> 205.222.8108 (cell)
> timhale at uab.edu
>
> _______________________________________________
> CITASA mailing list
> CITASA at list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
> -------------------------------------------------
>
> _______________________________________________
> CITASA mailing list
> CITASA at list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
> CMQCC: Transforming Maternity Care
>
> -------------------------------------------------
>
> Christine H. Morton, PhD
> Program Manager/Research Sociologist
> California Maternal Quality Care Collaborative
>
> Stanford University p. 650-725-6108
> f. 650-721-5751
> Medical School Office Building d.
> 650-721-2187 c. 650-995-4550
> 251 Campus Drive
> Palo Alto, CA 94305-5415
>
> cmorton at stanford.edu www.cmqcc.org
>________________
>_______________________________________________
>CITASA mailing list
>CITASA at list.citasa.org
>http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
--------------------------------------
Caroline Haythornthwaite
Leverhulme Visiting Professor, Institute of Education, University of London (2009-10)
Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St., Champaign IL 61820 (haythorn at illinois.edu)
More information about the CITASA
mailing list