[CITASA] seeking suggestions about research methodology for online forums

Caroline Haythornthwaite haythorn at illinois.edu
Fri Feb 26 05:30:21 CST 2010


I forwarded Tim Hale's question to Anatoliy Gruzd at Dalhousie, but others may also be interested in his "TextAnalytics" system (see http://anatoliygruzd.com/home/?page_id=27). This provides a home for text analysis and network analysis of threaded discussions.

/Caroline

---- Original message ----
>Date: Thu, 25 Feb 2010 19:34:05 -0800
>From: Christine Morton <christine at christinemorton.com>  
>Subject: Re: [CITASA] seeking suggestions about research methodology for online forums  
>To: "Thomas M. Lento" <thomas.lento at gmail.com>, Tim Hale <timhale at uab.edu>
>Cc: CITASA at list.citasa.org
>
>   Wow.  Thank you for the question, Tim, and for
>   sharing the outline of your research.  It sounds
>   very interesting and relevant.  Would you be
>   interested in sharing more details with me offlist?
>    and thanks Tom for your thoughtful and thorough
>   reply.    I've learned a lot.
>   Regards,
>   Christine
>
>   On 2/25/10 6:54 PM, "Thomas M. Lento"
>   <thomas.lento at gmail.com> wrote:
>
>     There are all kinds of supervised and
>     semi-supervised content analysis approaches that
>     might be applicable - their effectiveness depends
>     on the nature of your data, the level of detail
>     you require, and your ability to design training
>     data. For example, if you have a superset of
>     health information topics and keywords associated
>     with them you could probably do a simple keyword
>     analysis to give you an idea of which posts are
>     talking about which topic. The advantage is that's
>     pretty easy to do even if the data is stored as
>     text blobs in your MySQL database. However, if you
>     need to do serious contextual analysis or you
>     don't already know the scope and range of possible
>     health topics and keywords that's a more difficult
>     problem. You'll probably need to try some
>     different machine learning approaches and see what
>     works best for you needs. You can do literature
>     searches for papers published at ICWSM, KDD, or
>     WWW for examples of some approaches. There are
>     other conferences that will be applicable, but
>     those are good places to start. Follow the
>     citation trail from the relevant pieces and you
>     should get some idea of where to look for useful
>     references.
>
>     Network analysis of these things can either be
>     pretty straightforward, moderately complicated, or
>     totally impossible. It all depends on how your
>     data is structured and what type of information is
>     available. If you've got a standard relational
>     database structure designed to drive the content
>     in the online community then at best you're in for
>     some work to reformat the data tables into
>     something useful for actual analysis. What you
>     hope for are tables with various combinations of
>     userid, postid, time stamp, user_data, and
>     post_data. You'll probably need to join several
>     tables to get the actual data files you need, and
>     although MySQL is not optimized for joins it
>     should still be manageable.
>
>     Assuming you're new to this type of analysis, my
>     advice is to spend a lot of time familiarizing
>     yourself with the underlying data. Do some basic
>     distributions and see how much noise you've got in
>     the system, find out the most efficient routes to
>     generating the output you need, and discover what
>     information is available in which table and how
>     those tables are keyed and indexed. Make sure
>     you're on the lookout for garbage data - a lot of
>     the great data you get from various online sources
>     is basically bad, either because of system errors
>     (rare and usually easy to find) or "bad" users
>     (common and not so easy to find - there's a whole
>     literature on spam detection algorithms out
>     there). You'll need to make decisions about your
>     error tolerance and what types of behaviors you
>     wish to ignore, and you can only do that
>     effectively if you understand your data.
>
>     I don't know how much experience you have with
>     database queries, but assuming you can only handle
>     moderately complex queries my advice is to use the
>     database as a source for your final dataset and
>     then conduct your analysis in some other tool. My
>     typical approach to this situation is to write
>     database queries that produce flat text files with
>     one row per observation (typically per user, or
>     per user/time_period combination, but this
>     obviously depends on your research question) with
>     one column for each metric. Then I load the data
>     into R or Stata or whatever else and build models.
>     You can do a fair amount of work in MySQL, but
>     this is typically slower and more difficult than
>     exporting and using an actual statistical package.
>
>     The drawback of exporting data is that you're
>     limited to whatever your stats package can hold in
>     memory. If your data set is large (hundreds of
>     thousands or millions of observations) then you
>     need a fair amount of memory to run any kind of
>     complex model. If you're dealing with 10s or 100s
>     of millions of observations in your model then
>     things get really interesting - I suggest
>     sampling, but there are other more difficult
>     options.
>
>     If you don't know MySQL at all, you need to learn
>     it. There are a plethora of books on MySQL out
>     there - I like O'Reilly for reference and SAMS for
>     instruction, so if you get something from one of
>     those publishers you should be ok. You will also
>     want to learn how to do some scripting in Python
>     or Perl. For the rank beginner, I recommend going
>     with Python and learning by working through the
>     chapters and exercises in How To Think Like a
>     Computer Scientist (free online at
>     http://www.greenteapress.com/thinkpython/thinkCSpy/html/
>     ). If you know how to program in general then
>     diveintopython.org <http://diveintopython.org>  is
>     your best bet. I'm not a Perl guy, but I'm sure
>     someone can point you to resources.
>
>     For the actual network analysis, I'd first look
>     into NodeXL since it's easy to use, and if that
>     doesn't meet your needs I'd go with something like
>     igraph (see http://igraph.sourceforge.net/ ),
>     which works with both R and Python.
>
>     Best of luck.
>
>     -Tom
>
>     On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale
>     <timhale at uab.edu> wrote:
>
>       Hi everyone,
>
>       I am working with others on a project that will
>       examine health communication among members of an
>       online community who post to an online forum. We
>       have three primary goals: (1) to conduct a
>       content analysis to understand the types of
>       health information that is communicated; (2)
>       identify the context of the discussions,
>       including understanding the characteristics of
>       the individuals who initiate and disseminate
>       health information; and (3) to conduct a social
>       network analysis to examine the larger
>       structures of health information sharing among
>       community members.
>
>       Although this type of research could be
>       conducted by manually collecting posts from the
>       online forum, coding for content, and the
>       creation of a data set for social network
>       analysis -- we are interested in other
>       approaches that make better use of the forum
>       database files. We have the cooperation of the
>       website owner and administrator to access the
>       MySQL database.
>
>       I am seeking advice from anyone with experience
>       working on similar research questions involving
>       online forums and especially, making use of the
>       original forum database files. All
>       recommendations, suggestions, and pointers to
>       articles, books, and appropriate tools are
>       welcome and greatly appreciated.
>
>       Thank you,
>       Tim Hale
>
>       ------------------------------------------------------------
>       Timothy M. Hale, MA
>       University of Alabama at Birmingham
>       Department of Sociology
>       Heritage Hall 460E
>       1401 University Boulevard
>       Birmingham, AL 35294-1152
>       205.222.8108 (cell)
>       timhale at uab.edu
>
>       _______________________________________________
>       CITASA mailing list
>       CITASA at list.citasa.org
>       http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
>    -------------------------------------------------
>
>     _______________________________________________
>     CITASA mailing list
>     CITASA at list.citasa.org
>     http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
>   CMQCC:  Transforming Maternity Care 
>
>    -------------------------------------------------
>
>   Christine H. Morton, PhD
>   Program Manager/Research Sociologist
>   California Maternal Quality Care Collaborative
>
>   Stanford University                 p. 650-725-6108
>       f.  650-721-5751
>   Medical School Office Building            d.
>   650-721-2187    c. 650-995-4550
>   251 Campus Drive
>   Palo Alto, CA  94305-5415
>
>   cmorton at stanford.edu        www.cmqcc.org
>________________
>_______________________________________________
>CITASA mailing list
>CITASA at list.citasa.org
>http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
--------------------------------------
Caroline Haythornthwaite

Leverhulme Visiting Professor, Institute of Education, University of London (2009-10)

Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St., Champaign IL 61820 (haythorn at illinois.edu)




More information about the CITASA mailing list